Oral

Computer intensive methods, software development, and graphics

Presenter: Chris Brien

When: Tuesday, July 12, 2016      Time: 9:00 AM - 10:30 AM

Room: Saanich 1-2 (Level 1)

Session Synopsis:

Randomizing and checking standard and multiphase designs using the R package dae

Firstly, given a systematic design for the assignment of treatment factors to unit factors and that the unit factors form a poset unit structure, the randomization of the treatment factors to the unit factors can be achieved by permuting the unit (unrandomized) factors according to the nesting and crossing relationships between the unit factors. Secondly, when planning an experiment, it is insightful to check the properties of proposed systematic designs and/or randomized layouts by obtaining their skeleton ANOVA tables. Mistakes in constructing the design will also be exposed. The skeleton ANOVA table gives the sources derived from the factors that it is anticipated will affect the response during the experiment. When they are based on factor-allocation diagrams then they show the confounding between sources [Brien, C. J., B. D. Harch, R. L. Correll, and R. A. Bailey (2011) Multiphase experiments with at least one later laboratory phase. I. Orthogonal designs. Journal of Agricultural, Biological and Environmental Statistics, 16, 422-450] and the amount of confounding can be quantified using the canonical efficiency factors. Skeleton ANOVA tables can be derived for any design, including orthogonal, nonorthogonal and multiphase designs, via an eigenanalysis using the theory of [James, A. T. and G. N. Wilkinson (1971) Factorization of the residual operator and canonical decomposition of nonorthogonal factors in the analysis of variance. Biometrika, 58, 279-294]. The R package dae [Brien, C. J. (2015) dae: functions useful in the design and ANOVA of experiments. URL: http://cran.at.r-project.org/web/packages/dae/index.html, (R package version 2.7-6, accessed November 5, 2015)] includes two functions that can assist here: (i) fac.layout that generates a randomized layout from a systematic design by permuting the unrandomized (unit) factors according to the nesting and crossing relationships between factors; and (ii) projs.canon that performs a canonical analysis of the relationships between two or more sets of projectors using the James-Wilkinson theory and summarizes the results in the form of a skeleton ANOVA table. To obtain the skeleton ANOVA table using the latter function, all that is required are two or more formulae and a data frame with the factors. The use of these two functions will be illustrated using examples.

Computer intensive methods, software development, and graphics

Presenter: John Newell

When: Tuesday, July 12, 2016      Time: 9:00 AM - 10:30 AM

Room: Saanich 1-2 (Level 1)

Session Synopsis:

Dynamic Nomograms as a Translational Tool.

Translational medicine, often described as “bench to bedside”, promotes the convergence of basic and clinical research disciplines to improve the flow from laboratory research through clinical testing/evaluation to standard clinical care. In an analogous fashion, we propose the concept of translational statistics to facilitate the integration of biostatistics within clinical research and enhance communication of research findings in an accurate and accessible manner to diverse audiences (e.g. policy makers, patients and the media). Reporting of statistical analyses often focuses on methodological approaches for the scientific aspects of the study; translational statistics aims to make the scientific results useful in practice. This transfer of knowledge (“from desk to decision”) informs both clinicians and patients of the benefits and risks of therapies. In this presentation examples will be given to illustrate how modern web-based computing tools allow the development of interactive dynamic tools for communicating and exploring research findings. As statistical inferential methods become more computational the models arising are increasingly complex and difficult to interpret. Nomograms can be generated to calculate the required predicted probabilities for values of the explanatory variables in a model. Such static nomograms can become cumbersome as the model becomes more complex including higher order interactions and explanatory variables with different functional forms. Our R package DynNom allows the creation of a dynamic nomogram for generalised linear models including time to event data. The resulting nomograms are interactive, allowing a user to see the effect each explanatory variable has on the response and how the response changes as modifiable risk factors are increased/decreased. The nomogram is accompanied by a comprehensive summary of the underlying model. In theory any model appearing in a scientific publication can be accompanied by a URL directing the ‘user’ to the accompanying dynamic nomogram from which the results of the models are directly translational and the suitability of the model verified through automatically generated model summaries.

Computer intensive methods, software development, and graphics

Presenter: Garth Tarr

When: Tuesday, July 12, 2016      Time: 9:00 AM - 10:30 AM

Room: Saanich 1-2 (Level 1)

Session Synopsis:

Model selection stability with mplot

This talk introduces the mplot R package which provides a collection of functions to aid exploratory model selection (Tarr et al., 2015). We have developed routines for modified versions of the simplified adaptive fence procedure (Jiang et al., 2009) as well as other graphical tools such as variable inclusion plots and model selection plots (Müller and Welsh, 2010; Murray et al., 2013). A browser based graphical user interface is provided to facilitate interaction with the results. These variable selection methods rely heavily on bootstrap resampling techniques. Fast performance for standard linear models is achieved using the branch and bound algorithm provided by the leaps package. Reasonable performance for generalised linear models and robust models can be achieved using sensible default tuning parameters and parallel processing. In higher dimensions where exhaustive searches are infeasible, boostrapping regularised estimators, such as the lasso, allows us to ascertain the stability of the estimated models. The methods implemented in mplot allow us to better explore the stability of model selection criteria. References Jiang J., Nguyen T. and Rao J.S. (2009). A simplified adaptive fence procedure. Statistics and Probability Letters, 79(5), 625-629. DOI: 10.1016/j.spl.2008.10.014 Mueller S. and Welsh A.H. (2010). On Model Selection Curves. International Statistical Review, 78, 240-256. DOI: 10.1111/j.1751-5823.2010.00108.x Murray K., Heritier S. and Mueller S. (2013). Graphical tools for model selection in generalized linear models. Statistics in Medicine, 32, 4438-4451. DOI: 10.1002/sim.5855 Tarr G., Mueller S. and Welsh A.H. (2015). mplot: An R package for graphical model stability and variable selection procedures. arxiv.org/abs/1509.07583

Computer intensive methods, software development, and graphics

Presenter: Jimmy Zeng

When: Tuesday, July 12, 2016      Time: 9:00 AM - 10:30 AM

Room: Saanich 1-2 (Level 1)

Session Synopsis:

Bootstrapped Model-Averaged Confidence Intervals

Model-averaging is commonly used to allow for model uncertainty in parameter estimation. In the frequentist setting, a model-averaged estimate of a parameter is a weighted mean of the estimates from the individual models, with the weights being based on an information criterion, such as AIC, or on bootstrapping. A Wald condence interval based on this estimate will often perform poorly, as its sampling distribution will generally be distinctly non-normal and estimation of the standard error is problematic. We propose a new method that uses a studentized bootstrap approach. We illustrate its use with a lognormal example, and perform a simulation study to compare its coverage properties with those of existing intervals.

Computer intensive methods, software development, and graphics

Presenter: Gustavo de los Campos

When: Tuesday, July 12, 2016      Time: 9:00 AM - 10:30 AM

Room: Saanich 1-2 (Level 1)

Session Synopsis:

BGData: A Suite of R Packages for Linked Memory Mapped Arrays and Computational Methods for Big Biomedical Data

Modern biomedical data sets (e.g., data from biobanks and consortia) can be very large. Storing, handling, and analyzing these data sets can be extremely challenging: (i) often the size of the data is such that it cannot be held in memory, (ii) sometimes data is generated and stored in different files or even at different centers and may not be shared easily, finally, (iii) computations on these Big Data sets often exhaust the single-threaded nature of many popular scientific computational environments such as R. The use of memory mapping, linked arrays, and parallel computing methods are strategies that can be used to confront those challenges. We have developed a suite of packages for the R environment that allows researchers to carry out analyses with potentially very large data sets in R. The suite involves four main packages: (i) BEDMatrix (https://github.com/QuantGen/BEDMatrix) defines classes and methods for memory mapped arrays for genotypes using BED files, (ii) LinkedMatrix (https://github.com/QuantGen/LinkedMatrix) implements classes and methods for row-linked and column-linked matrices that form a single array from multiple memory mapped arrays, (iii) symDMatrix (https://github.com/QuantGen/symDMatrix) defines classes and methods for linked memory mapped symmetric matrices that are particularly well suited for very large similarity (e.g., kinship) or distance matrices. Finally, (iv) the BGData package (https://github.com/QuantGen/BGData) acts as an umbrella to the other packages and defines advanced parallelized computational methods for linked memory mapped arrays that can carry out computations without fully loading the array into memory. BGData also offers a few tools for genomic data analyses. Multiple dispatch enables us to offer methods for indexing and replacement, access and modification of attributes, and computation with the same interface used by regular matrices. Consequently, users can use BEDMatrix, LinkedMatrix and symDMatrix objects as if they were regular RAM matrices. Our benchmark demonstrates that the BGData package can handle very large data sets (with tens of thousands of rows and hundreds of thousands of columns) and the computational methods implemented outperform the same methods available in the base package by a factor of 3-6, depending on the method and the number of cores available.

Computer intensive methods, software development, and graphics

Presenter: Nombasa Ntushelo

When: Tuesday, July 12, 2016      Time: 9:00 AM - 10:30 AM

Room: Saanich 1-2 (Level 1)

Session Synopsis:

MULTIPLE FACTOR ANALYSIS AND PRINCIPAL COMPONENT ANALYSIS WITH R AND FACTOMINER IN EXPLORATORY DATA ANALYSIS OF WHITE WINES.

Multiple Factor Analysis (MFA) is recognised as an extension of Principal Component Analysis (PCA) in continuous data and an extension of Multiple Correspondence Analysis (MCA) in Categorical data. The Multiple Factor Analysis (MFA) falls under the Advanced Methods whereas the Principal Component Analysis (PCA) and the Multiple Correspondence Analysis (MCA) fall under the Classical Methods. MFA is applied in data sets where a set of individuals or observations are described by more than one set of variables in a data set. The different sets of variables could be continuous and some could be categorical. An example of a data set where the Multiple Factor Analysis can be applied is a wine data set. Usually, in a wine data set, a set of individuals, as wines, is described by two sets / groups of variables which is chemical variables and sensory attributes. Most of the time, Principal Component Analysis is used to analyse wine data where each set of variables (chemical data/ sensory data) are analysed separately. The advantage of the Multiple Factor Analysis over the Principal Component Analysis is that the MFA produces a combined analysis of results of chemical data and sensory data leading to a combined interpretation of results of chemical data, sensory data together with wines. A comparison of Principal Component Analysis and the Multiple Factor Analysis of wine data will be presented. Both MFA and PCA were perfomed in R using a wonderful package called FactoMineR.