Oral

Regression 1

Presenter: David Affleck

When: Monday, July 11, 2016      Time: 4:00 PM - 5:30 PM

Room: Saanich 1-2 (Level 1)

Session Synopsis:

Maximum Likelihood Estimation Of Multivariate Tree Biomass Models From Incomplete Records

Tree biomass models used to estimate forest carbon stocks are typically multivariate in nature, supplying estimates of the total biomass as well as the biomass in stem, branch, foliage, and other tree components. It is common practice to develop such models so as to ensure compatibility among total and component estimates and to estimate all model parameters jointly using multi-stage least squares methods. Yet standard estimation methods typically have not recognized the additivity of tree biomass data themselves nor the implied stochastic constraints necessary for the specification of a valid underlying probability model. For model selection based on information criteria, stochastic simulation, Bayesian inference, or estimation from missing data, it is important to base estimation and inference on probabilistic models. Therefore, we show how to specify valid stochastic models for nonlinear systems of tree biomass component equations and how to estimate model parameters using maximum likelihood (ML). We also show how the model and ML procedures can be extended to accommodate unobserved or aggregated component biomass data. In particular, using a slash pine (Pinus elliottii) data set, we illustrate how observed data covariance structures for incomplete records can be described to allow for Gaussian ML estimation of a system of nonlinear component biomass models using open-source software. Finally, we highlight the utility of this approach for assimilating biomass data sets collected under different protocols in order to develop credible models of forest biomass and carbon at regional to national levels.

Regression 1

Presenter: Mark Brewer

When: Monday, July 11, 2016      Time: 4:00 PM - 5:30 PM

Room: Saanich 1-2 (Level 1)

Session Synopsis:

Model Selection and the Cult of AIC

Model selection is difficult, even in the apparently straightforward case of choosing between linear regression models. There has been a lively debate in the statistical ecology literature in recent years, where some authors have sought to evangelise Akaike's Information Criterion (AIC) in this context while others have disagreed strongly. A series of discussion articles in the journal Ecology in 2014 dealt with part of the issue: the distinction between AIC and p-values. But within the family of information criteria, is AIC always the best choice? Theory suggests that AIC is optimal in terms of prediction, in the sense that it will minimise out-of-sample root mean square error (RMSE) of prediction. Earlier simulation studies have largely borne out this theory. However, we argue that since these studies have almost always ignored between-sample heterogeneity, the benefits of using AIC have been overstated. We tackle this issue via a novel simulation framework. In ecology, different data sets are generally obtained at different points in time or space; a realistic simulation scheme should not keep a generating model fixed, but allow covariate distributions and even effects themselves to vary, the latter in a random effect sense. We study a range of different sample sizes, relative effect sizes and levels of heterogeneity between samples. We find that the relative predictive performance of model selection by different information criteria is heavily dependent on the degree of unobserved heterogeneity between data sets. When heterogeneity is small, AIC is likely to perform well, but if heterogeneity is large, the Bayesian Information Criterion (BIC) will often perform better, as it has a stronger penalty. We propose a practical solution to the problem, framing it as one of choosing an appropriate level of penalty in a given context.

Regression 1

Presenter: Riccardo De Bin

When: Monday, July 11, 2016      Time: 2:00 PM - 5:30 PM

Room: Saanich 1-2 (Level 1)

Session Synopsis:

Resampling-based model-building procedures and influential points in multivariable regression

In statistical practice, the analyst often faces the problem of choosing which variables should be included in the final model. Usually, variable selection procedures such as backward elimination, stepwise regression or all-subset approaches are implemented, although it is well known that they have several shortcomings. One of the most problematic is their “instability”, in the sense that small changes in the data may cause severe differences in the set of the selected variables. To tackle this issue, resampling-based approaches have been introduced. They exploit resampling techniques such as bootstrapping or subsampling to mimic data perturbations. In particular, several pseudo-samples are generated and a variable selection procedure is applied to each of them. The information is usually stored in an inclusion matrix, in which each column reports whether the corresponding variable was selected or not in the specific pseudo-sample (row). Only variables which correspond to large values for the column means are included in the final model. Resampling-based approaches have been investigated by De Bin et al (Biometrics, to appear) with focus on the choice of the resampling technique. The information available in the inclusion matrix can be further used to investigate the effect of individual observations on the result of the model-building procedure. In this talk, we show that it is possible to detect influential points by comparing the rows of the inclusion matrix. In addition to the result of variable selection, each row also contains information concerning the corresponding pseudo-sample, indicating whether it includes a specific observation. For each observation, therefore, we have two groups of rows (observation in/observation out). Strong differences in the selected variables for the two groups are a signal of the influence of one observation in the model-building procedure. Since these investigations are based only on the inclusion matrix, the method does not require any further computation in addition to those necessary to perform a resampling-based model-building procedure. Noticeably, we tackle the influential points issue from a variable point of view, focusing on the connection between influential points and variable selection. This connection seems to be scarcely investigated in the literature. We illustrate our novel approach in two real data examples.

Regression 1

Presenter: Thomas Louis

When: Monday, July 11, 2016      Time: 4:00 PM - 5:30 PM

Room: Saanich 1-2 (Level 1)

Session Synopsis:

ESTIMATING A POPULATION ATTRIBUTE WHEN SAMPLE SIZE IS INFORMATIVE

When analyzing hierarchical data with the target of inference being the average of a cluster-specific mean parameter, averaging cluster-specific estimates (MLEs) will be unbiased, while the MLE assuming all cluster-specific parameters are equal is minimum variance. However, if the cluster-specific parameter and sample size are correlated, this MLE will be biased and a compromise estimate or a regression adjustment can produce improved Mean Squared Error (MSE). For example, when estimating the mean menstrual cycle length for a random sample of couples participating in a prospective pregnancy study, each woman contributes cycle lengths until pregnancy or censoring after a fixed number of cycles. The less fecund couples will contribute more cycles and if fecundity and menstrual cycle length are related, a pooled estimate using all cycles for all women will be biased. Also motivated by this example, when estimating the cycle-specific probability of pregnancy, the estimate pooled over all women and cycles will overweight the less fecund, but neither the stratified nor the pooled approaches are effective, and we propose a mixture approach. We compare performance of the compromise and regression approaches in the first context and evaluate the mixture approach in the second. Other applications include hospital comparisons wherein when estimating a risk model large hospitals are given more weight than small. If size is informative and the model isn’t properly specified, risk adjustments will be biased against the smaller hospitals.

Regression 1

Presenter: Grégory Nuel

When: Monday, July 11, 2016      Time: 4:00 PM - 5:30 PM

Room: Saanich 1-2 (Level 1)

Session Synopsis:

Breakpoint Model for Logistic Regression and Application to the detection of G x E Interactions

Introduction: Complex diseases are known to be highly heterogeneous in nature. This heterogeneity can be due to various factors including genetic heterogeneity (ex: population stratification), phenotypic heterogeneity (ex: clinical diagnosis of schizophrenia), exposure heterogeneity to various environmental factors (ex: alcohol, drugs, pollution, etc.), and recruitment heterogeneity over time (the so-called « cohort-effect »). In the context of case-control studies, detecting and accounting for this heterogeneity can help to identify high-risk subgroups in the population and provide a better understanding of the disease. Method: In this context, we introduce a new approach to detect and account for any source of heterogeneity by introducing a breakpoint model for logistic regression. Our model is based on a constrained Hidden Markov Modelling using a constrained Markov model for the hidden segmentation and a logistic regression model for the observed part. Parameter learning is performed by combining Forward-Backward recursion with the Expectation-Maximization algorithm. The model output includes both regression estimate in each segment and the full posterior distribution of the breakpoints. Results: We validate and illustrate the usefulness of our model on both simulated and real dataset. In particular, we show that if individuals are ordered according to some proximity space (ex: by increasing BMI (Body Mass Index)) we can use our model to detect interactions between genes and latent exposures using likelihood ratio test framework. This last result seems particularly promising since it provides a unique way to distinguish confounding factors (ex: sex for smoking) and genuine non-observed causal exposures.

Regression 1

Presenter: Sen Zhao

When: Monday, July 11, 2016      Time: 4:00 PM - 5:30 PM

Room: Saanich 1-2 (Level 1)

Session Synopsis:

High-Dimensional Hypothesis Testing With the Lasso

Hypothesis testing in high-dimensional linear models is difficult due to the inherent bias of high-dimensional estimators, such as the lasso. Most existing procedures are low in power, or computationally infeasible with large dimension. In this paper, we propose two simple and computationally attractive methods based on the lasso for hypothesis testing in high-dimensional linear models. These two proposals build on the fact that, under some standard assumptions, the set of variables selected by the lasso is almost surely fixed, and contains all of the variables that have non-zero regression coefficients. These theoretical results are applied, so that we restrict our attention to the set of features selected by the lasso. We then apply classical Wald and score tests on the reduced data set. Because the lasso-selected set is almost surely fixed, distribution truncation or sample splitting is not required to obtain asymptotically valid inference on the population regression coefficients; this is in contrast to the recently-proposed exact post-selection inference framework. We also establish connections between our proposals and the debiased lasso tests and investigate the advantages of our methods. Finally, we perform extensive numerical studies in support of our methods.