Missing and incomplete Data
Presenter: Dirk Enders
When: Thursday, July 14, 2016 Time: 4:30 PM - 6:00 PM
Room: Salon C Carson Hall (Level 2)
Session Synopsis:
Extension of the pseudo likelihood method to analyze two-phase studies with selective phase 2 samples
Two-phase studies can be used to resolve the problem of unmeasured confounding by obtaining additional information from a subsample (phase 2) of all patients (phase 1). Methods to analyze two-phase case-control studies require (i) a stratification of the patients in phase 1and (ii) the sampling probability for phase 2 to depend only on the stratification variable and the case-control status. We denote the number of cases and controls in phase 1 and phase 2 in strata j with N1j, N0j, n1j and n0j, respectively. The pseudo likelihood method by Breslow and Cain (1988) is a logistic regression based on the complete data including a stratum-specific offset, ?j = log(n1j /N1j) - log(n0j /N0j) , to adjust for unequal sampling probabilities within cases and controls in strata j. We developed an extension of the pseudo likelihood method for more complex sampling situations, i.e. in case the sampling probability for phase 2 does not depend on the stratification variable but on some vector Z of variables in phase 1. Therefore, we assume that the sampling probability for phase 2 for each possible value of in the data, both for cases (w1z) and controls (w0z), is estimated from the data. Then, the pseudo likelihood method can still be performed with just a slight modification in the calculation of the offset ?j, more precisely, the stratum-specific sampling probabilities for phase 2, n1j /N1j and n0j /N0j, are replaced by the estimates of w1z and w0z, respectively. We applied the extended method in a study comparing the risk of cardiovascular events in type 2 diabetes patients treated with two antidiabetic combination therapies. We noticed, that in case Z is high-dimensional or Z contains a continuous variable, either cases or controls were missing for many values of Z. For these values of Z, either w1z or w0z could not be estimated from the data and needed to be extrapolated. Thus, although suitable for complex sampling situations, our extended pseudo likelihood method crucially relies on the assumption that this extrapolation is justified. References: Breslow, N.E, and Cain, K.C. Logistic regression for two-stage case-control data. Biometrika. 1988; 75:11–20
Missing and incomplete Data
Presenter: Ying Huang
When: Thursday, July 14, 2016 Time: 4:30 PM - 6:00 PM
Room: Salon C Carson Hall (Level 2)
Session Synopsis:
Evaluating and Comparing Biomarkers in Two-Phase Case-Control Studies
Two-phase sampling design, where biomarkers are subsampled from a phase-one cohort sample representative of the target population, has become the gold standard in biomarker evaluation. Many two-phase case-control studies involve biased sampling of cases and/or controls in the second phase. For example, controls are often frequency-matched to cases with respect to other covariates. Ignoring biased sampling of cases and/or controls can lead to biased inference regarding biomarkersÂ’ classification accuracy. Considering the problems of estimating and comparing the area under the receiver operating characteristics curve (AUC) for a binary disease outcome, the impact of frequency matching on inference and the strategy to efficiently account for bias sampling of biomarkers have not been well studied. In this project, we investigate and compare different inverse-probability weighted methods to adjust for biased sampling in estimating and comparing AUC. Asymptotic properties of these estimators and their inference procedures are developed for both Bernoulli sampling and finite-population stratified sampling. In simulation studies, the weighted estimators provide valid inference for estimation and hypothesis testing, while the standard empirical estimators can be severely biased. Based on both analytical variance formulae and simulation studies, we show that using estimated weights can lead to much improved efficiency for estimating performance of a single marker even when biomarker sampling probabilities are known. In contrast, the improvement due to weights estimation is relatively minor for comparing performance of paired markers, especially if the weights have been calibrated such that phase-one sample sizes equal to their inverse-probability-weighted estimates based on phase-two sample. We demonstrate the use of analytical variance formula for optimizing sampling schemes in biomarker study design and demonstrate the application of the proposed AUC estimators to an example of case-control biomarker studies nested within an HIV vaccine trial.
Missing and incomplete Data
Presenter: Daniel Nevo
When: Thursday, July 14, 2016 Time: 4:30 PM - 6:00 PM
Room: Salon C Carson Hall (Level 2)
Session Synopsis:
The competing risks model with missing cause of failure and auxiliary case covariates
In the analysis of time-to-event data using a competing risks model, often the cause of failure is unknown for some of the cases. The probability of a missing cause is typically assumed to be independent of the cause given the time of the event and covariates measured before the event occurred. When further assuming a proportional hazard model for the cause-specific hazard, an estimating equations approach derived by combining two separate likelihood functions (for the same data) can be used. In practice, however, the underlying missing-at-random assumption does not necessarily hold. Motivated by colorectal cancer subtype analysis, we develop a method to conduct valid analysis when additional auxiliary variables are measured for cases only. We consider a weaker missing-at-random assumption, with missing pattern depends on the observed quantities, that include the auxiliary covariates. Overlooking these covariates will potentially result in biased estimates. We use an informative likelihood approach that will yield consistent estimates even when the underlying model for missing cause of failure is misspecified. The superiority of our method in finite samples is demonstrated by simulation study results. We illustrate the use of our method in the analysis of Nurses' Health Study colorectal cancer data, where, apparently, the traditional missing-at-random assumption fails to hold.
Missing and incomplete Data
Presenter: Lan Wen
When: Thursday, July 14, 2016 Time: 4:30 PM - 6:00 PM
Room: Salon C Carson Hall (Level 2)
Session Synopsis:
METHODS FOR HANDLING MISSING DATA DUE TO DEATH AND DROP-OUT IN MORTAL COHORTS
Cohort data are often incomplete because some subjects drop out of the study. Maximum likelihood, inverse probability weighting (IPW), multiple imputation, and linear increments (LI) are methods commonly used to deal with such missing data. In cohort studies of ageing, missing data can arise from drop-outs and deaths. If there is a non-negligible amount of deaths, one might want to distinguish between reasons for missingness to avoid making inferences about an immortal cohort: a cohort where no one can die. Instead, inferences based on those who are alive at any point in time (mortal cohort inference) might be of more interest. Kurland & Heagerty (2005, Biostatistics) and Dufouil et al. (2004, Stat. Med) introduced the IPW method to give mortal cohort inference when data might be missing due to drop-out. However, the assumptions of this method have not been described clearly. Diggle et al. (2007, JRSSC) introduced the LI method, and Seaman et al. (submitted) described the assumptions under which this method provides valid mortal cohort inference. However, the plausibility of these assumptions has yet to be examined. In this talk, I will explain the underlying assumptions of the IPW and LI methods and discuss the plausibility of these assumptions. The importance of clarifying the underlying assumptions can be seen in ageing studies where deaths are common, as results from these methods may be biased if some of their assumptions are not met. Through simulations I will compare the bias and efficiency of methods for making mortal cohort inference in the presence of drop-out: IPW, LI, augmented IPW, and multiple imputation. I will describe an application of these methods to the OCTO Study of cognitive ageing.
Missing and incomplete Data
Presenter: Julia GERONIMI
When: Thursday, July 14, 2016 Time: 4:30 PM - 6:00 PM
Room: Salon C Carson Hall (Level 2)
Session Synopsis:
VARIABLE SELECTION FOR MULTIPLY-IMPUTED DATA WITH PENALIZED GENERALIZED ESTIMATING EQUATIONS
The generalized estimating equations (GEE) are a useful tool for marginal regression analysis with repeated measurements and longitudinal data. Penalized regressions, such as LASSO, have been extended to GEE to allow shrinkage and dimension reduction but can't handle missingness. Missing data as well as a large number of variables combined with small sample size are usual issues faced with longitudinal data. Multiple imputation is a popular tool for handling missing data and in particular, the MI-GEE can be used for inference. Although methods to handle missing data such as MI-GEE have been improved, variable selection for GEE has not been systematically processed to integrate missing data. The multiple imputation-least absolute shrinkage and selection operator (MI-LASSO) proposes a consistent selection through the multiple-imputed datasets but cannot handle correlation among individual observations. We present MI-PGEE, a new multiple imputation-penalized generalized estimating equations as an extension of the MI-LASSO to be applied on longitudinal data. MI-PGEE applies the group LASSO penalty to the group of estimated regression coefficients of the same variable, across multiple-imputed datasets. Estimates are computed using a Local Quadratic Approach and an algorithm based on a modified Newton-Raphson method ; a new BIC-like criterion is presented in order to select the tuning parameter. MI-PGEE yields a consistent variable selection across multiple-imputed datasets, making this a selection method for longitudinal data able to manage missing data. We simulated different patterns of correlation structure between continuous as well as binary variables : our results demonstrate advantages of MI-PGEE compared to simple imputation PGEE such as mean imputation. The usefulness of the new method is illustrated by an application on osteoarthritis of the knee to identify important biomarkers and magnetic resonance imaging criteria that are associated with joint space wifth.
Missing and incomplete Data
Presenter: Nicole Erler
When: Thursday, July 14, 2016 Time: 4:30 PM - 6:00 PM
Room: Salon C Carson Hall (Level 2)
Session Synopsis:
Bayesian imputation of (non-)linear endo- vs. exogenous time-varying covariates in linear mixed models
Missing values are a common challenge in the analysis of most observational studies. A standard solution to this problem is to apply multiple imputation, most often utilizing the full conditional specification. However, in more complex settings, such as with a longitudinal outcome and incomplete time-varying and baseline covariates, multiple imputation using the full conditional specification can be sub-optimal since the outcome cannot be easily and adequately included in the imputation models. In settings with time-varying covariates, two issues have to be considered: the endo- or exogeneity of the covariate and its functional relationship with the outcome. The first issue relates to whether or not the covariate values are affected by the outcome, whereas the second one to which features of the covariate are related to the outcome, e.g. the slope of the trajectory instead of its value or a cumulative effect up to a certain time. In this work, we study and extend two approaches to handle such incomplete data in the Bayesian framework: The first approach approximates the joint distribution of the outcome and the covariate by assuming (latent) normal distributions for each variable and draw multiple imputations from the resulting multivariate normal distribution (e.g. Carpenter & Kenward, 2013). The second approach factorizes the joint likelihood into a sequence of conditional distributions (Ibrahim et al., 2002). These approaches imply different assumptions about the endo- or exogeneity of longitudinal covariates. In approach 1, imputation models of the outcome and longitudinal covariate are connected by specifying a joint distribution for their random effects and/or error terms, which implies an endogenous time-varying covariate. In approach 2, the longitudinal covariate is included in the linear predictor of the outcome but its imputation model can be specified independently of the outcome, which implies exogeneity. Our work adds to the literature by extending approach 2 to settings with time-varying covariates with different functional forms and adapt it for endogenous covariates by jointly modeling the random effects of the outcome and the covariate, and by investigating the robustness of the two approaches with regards to miss-specification of that functional form as well as miss-specification of the endo- or exogenous nature of time-varying covariates.