Oral

Microarrays and omics data 2

Presenter: Ruth Heller

When: Thursday, July 14, 2016      Time: 11:00 AM - 12:30 PM

Room: Salon B Carson Hall (Level 2)

Session Synopsis:

Powerful and omnibus consistent distribution-free K-sample and independence tests

A popular approach for testing if two univariate random variables are statistically independent consists of partitioning the sample space into bins, and evaluating a test statistic on the binned data. The partition size matters, and the optimal partition size is data dependent. While for detecting simple relationships coarse partitions may be best, for detecting complex relationships a great gain in power can be achieved by considering finer partitions. We suggest novel consistent distribution-free tests that are based on summation or maximization aggregation of scores over all partitions of a fixed size. We show that our test statistics based on summation can serve as good estimators of the mutual information. Moreover, we suggest regularized tests that aggregate over all partition sizes, and prove those are consistent too. We provide polynomial-time algorithms, which are critical for computing the suggested test statistics efficiently. We show that the power of the regularized tests is excellent compared to existing tests, and almost as powerful as the tests based on the optimal (yet unknown in practice) partition size. We demonstrate the usefulness of our approach on two real data example. Our first example examines the co-dependence between pairs of genes in a yeast gene expression dataset with 300 samples, where we discover many nonlinear associations that were not discovered by other tests of independence. Our second example examines the association between 10000 SNPs in pairs of psychiatric disorders. Joint work with Yair Heller, Shachar Kaufman, Barak Brill, and Malka Gorfine

Microarrays and omics data 2

Presenter: Gilles Monneret

When: Thursday, July 14, 2016      Time: 11:00 AM - 12:30 PM

Room: Salon B Carson Hall (Level 2)

Session Synopsis:

Identification of marginal causal relationships in gene networks from observational and interventional expression data.

Gene network inference from transcriptomic data is a recent and major methodological challenge, usually based on partial correlations within a Gaussian graphical model framework. Recent methodological advances that fully exploit both observational and interventional (i.e., knock-out or knock-down) data go one step further by enabling the inference of causal networks. All these approaches suffer, however, from the same limitation: the number of parameters to be estimated is much larger than the very small number of biological replicates. As such, a significant bottleneck is the selection of a targeted and meaningful sub-group of genes of interest prior to causal network inference. In practice, this is often done using either biological knowledge of a gene’s function or pathway, or by filtering genes using a statistical test (e.g., differential analysis). In this work, we propose a novel, objective, and targeted approach to identify candidates of causal downstream relationships in gene expression experiments consisting of observational and partially available knock-out data (e.g., a knock-out of a single gene G). In particular, our approach proceeds in two steps. First, a correlation test is performed to identify genes that potentially interact with G; note that this initial group of genes may include those that are causally upstream or downstream of G, as well as those with spurious correlations. In a second step, in order to deconvolute these possible relationships, we define a novel causal test to identify the partial order for each of the interaction pairs (G with the genes identified in the previous step). The proposed procedure is very fast and can be applied to thousands of genes simultaneously, which allows the pre-selection of a group of genes of interest for downstream causal network inference around gene G. An R package is under development to make the proposed causal selection approach easily applicable. We illustrate on simulations and real data that the set of causally downstream genes identified is more meaningful than the subset identified by correlations or differential analyses alone.

Microarrays and omics data 2

Presenter: Trishanta Padayachee

When: Thursday, July 14, 2016      Time: 11:00 AM - 12:30 PM

Room: Salon B Carson Hall (Level 2)

Session Synopsis:

General linear models for investigating the dependence between gene module co-expression dynamics and a continuous trait

The current widespread availability of multiple omics datasets is particularly beneficial for uncovering the complex interplay between genes and their cellular environments. One way to enhance the understanding of the development and progression of complex diseases is to investigate the regulatory mechanisms behind gene co-expression (correlation of genes in a gene module). Often, changes in gene co-expression are investigated across two or more biological conditions defined by discretizing a continuous trait (for instance, metabolite concentration). However, for some traits it may be difficult to discretize the data. Moreover, the selection of arbitrary cut-off points is likely to have an influence on the results of an analysis. To address these issues, we employ general linear models (GLMs) for correlated data to study the relationships between metabolic concentrations and gene module co-expression. The GLMs specify the between-gene correlations (co-expression) as a function of the continuous trait (e.g., metabolic concentration). The use of the GLMs allows for the investigation of different patterns (linear, non-linear, etc.) of co-expression. Furthermore, the use of the modelling approach offers a formal framework for testing the significance of the observed pattern of co-expression. In our paper, the versatility of the approach is illustrated by using a real-life example and a simulation study. Additionally, we discuss the theoretical issues related to the construction of the test statistics and the computational challenges related to fitting the GLMs.

Microarrays and omics data 2

Presenter: Sen Zhao

When: Thursday, July 14, 2016      Time: 11:00 AM - 12:30 PM

Room: Salon B Carson Hall (Level 2)

Session Synopsis:

Testing for Differential Connectivity in High-Dimensional Networks

Changes in the connectivity of biological networks, such as brain connectivity networks and genetic pathways, have been found to be associated with the onset and progression of some complex diseases. Locating differently connected nodes or edges in the network of healthy and diseased individuals could help researchers delineate underlying disease mechanism, and design more effective treatments. In the past decade, various methods have been proposed to infer network structures from partial correlations. Such methods, however, do not quantify uncertainty, and have limited value when drawing scientific conclusions. Recently, some tests have been proposed for examining the equality of two (inverse) covariance matrices. However, changes in magnitudes of (inverse) correlations do not necessarily reflect changes in biological network structures. In this paper, we propose instead a framework for identifying qualitative changes in network structures, i.e., whether edge sets of two networks are the same. We show, using theoretical and numerical analyses, that our method controls the type-I error and is powerful in identifying qualitative differences in network structure. We demonstrate the applicability of our method by detecting changes in brain networks of trauma patients and healthy controls.

Microarrays and omics data 2

Presenter: Said el Bouhaddani

When: Thursday, July 14, 2016      Time: 11:00 AM - 12:30 PM

Room: Salon B Carson Hall (Level 2)

Session Synopsis:

Latent variable Meta-analysis with Probabilistic Partial Least Squares (PPLS)

In recent years the field of Glycomics is being expanded, creating new opportunities to identify potential biomarkers for diseases and traits such as BMI. We have access to measurements on IgG1 and IgG2 Glycomics subclasses (both 20 variables) and BMI (univariate) from the Leiden Longevity Study (Netherlands, N=499), Korcula (Croatia, N=951) and Vis (Croatia, N=794) cohorts. Getting biologically valid results when associating BMI with IgG1 and IgG2 is challenging, since the IgG measurements are highly correlated between and within the subclasses. Moreover, measuring such novel omics data is difficult and expensive, and consequently the size of cohorts available is limited. To overcome the problem of underpowered association analysis, results from multiple studies might be combined in a meta-analysis. However typically heterogeneity across studies is present and should be accounted for. Partial least squares (PLS) can be used to handle correlation between and within two datasets. It extracts a low dimensional latent space explaining the relation between two datasets to large extent. The low dimensional space consist of a set of latent variables which are linear combinations of the original IgG variables. These latent variables are typically biologically interpretable. However PLS doesn’t facilitate adjustments for heterogeneity across studies. We propose Probabilistic Partial least squares (PPLS), an unsupervised latent variable technique which provides a more sound mathematical and statistical framework. PPLS gives rise to a likelihood allowing for incorporating population heterogeneity effects and produces estimates for the underlying latent variables. These latent variables are used as covariates in a linear model with BMI as outcome. A simulation study will be conducted to investigate the performance of PPLS-based meta-analysis in terms of statistical power. Simulated outcomes with varying noise variance will be considered for association with real and simulated IgG variables. We consider three cohorts with varying level of heterogeneity. Results of the simulation study and the analysis of the datasets will be presented.