CTSI and The Department of Biostatistics Special Invited Seminar
HPNP G 114, November 15, 2017
Tony Barr, M.S
A Model of Reality Inc.
Title: The Search for Understanding: A Perspective from SAS Creator
My childhood dream of being an inventor led me to major in physics at North Carolina State University. However, I became fascinated with computers and took an assistantship programming with the Statistics Department. This led me to create the SAS language to bring statistics to the people. The theme of understanding gradually became a primary motivator for me in trying to write programs in many different domains. The history of SAS and the future of computing will be covered. A picture of the future is presented where people understand programming and programming becomes a universal skill.
November 9, 2017-11:00-12:00 CTRB 2161
Michael G. Hudgens
Professor, Department of Biostatistics, University of North Carolina, Chapel Hill
Director, Biostatistics Core, Center for AIDS Research, UNC
Title: Causal Inference in the Presence of Interference
A fundamental assumption usually made in causal inference is that of no interference between individuals (or units), i.e., the potential outcomes of one individual are assumed to be unaffected by the treatment assignment of other individuals. However, in many settings, this assumption obviously does not hold. For example, in infectious diseases, whether one person becomes infected depends on who else in the population is vaccinated. In this talk we will discuss recent approaches to assessing treatment effects in the presence of interference. Inference about different direct and indirect (or spillover) effects will be considered in a population where individuals form groups such that interference is possible between individuals within the same group but not between individuals in different groups. Analyses of a cholera vaccine study in over 100,000 individuals in Matlab, Bangladesh will be presented which indicate a significant indirect effect of vaccination.
January 8, 2016-11:00-12:00 CTRB 2161
Professor of Biostatistics (in Psychiatry)
Division of Biostatistics, Department of Psychiatry, Columbia University
As personalized medicine/precision medicine is emerging as a promising way to improve clinical decision-making, to customize clinical decisions for individual patients to accommodate the unique needs and preferences for each specific patient, there is a growing need for biostatistical methods to be developed and deployed to serve the needs for personalized medicine. As an example, the on-going NIH-funded PREEMPT Study, http://www.ucdmc.ucdavis.edu/chpr/preempt/, is developing a smartphone app that allows chronic pain patients and clinicians to run personalized experiments (n-of-1 trials), comparing two different pain treatments, to help patients and their clinicians to choose the most appropriate pain treatment for each individual patient. Such personalized biostatistical toolkits can be utilized by frontline clinicians and their patients to address the specific clinical questions confronted by each specific patients, to enable the specification and execution of the personalized trial protocol, to facilitate the collection of outcome and process data, to analyze and interpret the data acquired, and to produce reports to the end users to help them with evidence-based decision making. This paradigm exemplifies the potential for “Small Data” (as opposed to “Big Data”) to be deployed in clinical applications for the benefits of both today’s patients (quality improvement) and future patients (human subjects research).
- Joint work with Richard Kravitz, Christopher Schmid, and the PREEMPT Consortium.
January 22, 2016-11:00-12:00 CTRB 3161/3162
Preeminent Professor, Department of Biostatistics, University of Florida
Title: Multi-Sample Adjusted U-Statistics that Account for Confounding Covariates
Multi-sample U-statistics encompass a wide class of test statistics that allow the comparison of two or more distributions. U-statistics are especially powerful because they can be applied to both numeric and non-numeric (e.g., textual) data. However, when comparing the distribution of a variable across two or more groups, observed differences may be due to confounding covariates. For example, in a case-control study, the distribution of exposure in cases may differ from that in controls entirely because of variables that are related to both exposure and case status and are distributed differently among case and control participants. We propose to use individually-reweighted data (using the propensity score for prospective data or the stratification score for retrospective data) to construct adjusted U-statistics that can test the equality of distributions across two (or more) groups in the presence of confounding covariates. Asymptotic normality of our adjusted Y-statistics is established and a closed form expression of their asymptotic variance is presented. The utility of our procedures is demonstrated through simulation studies as well as an analysis of genetic data.
January 29, 2016 – 11:00-12:00 CTRB 3161/3162
Raymond J. Carroll
Distinguished Professor and Jill and Stuart A. Harlin 83 Chair in Statistics
Department of Statistics
Texas A&M University
Title: Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources
Information from various public and private data sources of extremely large sample sizes is now increasingly available for research purposes. Statistical methods are needed for utilizing information from such big data sources while analyzing data from individual studies that may collect more detailed information required for addressing specific hypotheses of interest. We consider the problem of building regression models based on individual-level data from an “internal” study while utilizing summary-level information, such as information on parameters for reduced models, from an “external” big-data source. We identify a set of very constraints that link internal and external models. These constraints are used to develop a framework for semiparametric maximum likelihood inference that allows the distribution of the covariates to be estimated using either the internal sample or an external reference sample. We develop extensions for handling complex stratified sampling designs, such as case-control sampling, for the internal study. Asymptotic theory and variance estimators are developed for each case. We use simulation studies and a real data application to assess the performance of the proposed methods.
April 1, 2016-11:00-12:00 CTRB 3161
Department of Biostatistics
Johns Hopkins University
Title: Wearable computing and clinical brain imaging
The talk will focus on wearable computing (accelerometers, heart monitors) and structural brain imaging and will introduce three distinct scientific areas. The first part of the talk will focus on movement recognition, identification of circadian patterns of activity and quantification of their association with health outcomes. The second part will focus on structural MRI (sMRI) and its application to longitudinal lesion segmentation, tracking, and quantification in Multiple Sclerosis patients. The third part will be dedicated to the association of stroke location and stroke severity using computed tomography (CT) brain imaging in a large clinical trial. Emphasis will be on the scientific problems and datasets.
General information about Dr. Crainiceanu’s work can be found at www.smart-stats.org, while the most relevant papers for this presentation can be found at
May 6, 2016-11:00-12:00 CTRB 2161/2162
Associate Chair for Research and Healthcare Informatics
Professor of Biostatistics, SPHHP & SMBS
Assistant Director, Institute for Health Care Informatics
Department of Biostatistics
The State University of New York at Buffalo
A Semi-parametric Method for Clustering Mixed Data
Despite the existence of a large number of clustering algorithms, clustering remains a challenging problem. As large datasets become increasingly common in a number of different domains, it is often the case that clustering algorithms must be applied to heterogeneous sets of variables, creating an acute need for robust and scalable clustering methods for mixed continuous and categorical scale data. We show that current clustering methods for mixed-type data suffer from at least one of two central challenges: (1) they are unable to equitably balance the contribution of continuous and categorical variables without strong parametric assumptions; or (2) they are unable to properly handle data sets in which only a subset of variables are related to the underlying cluster structure of interest. We first develop KAMILA (KAy-means for MIxed LArge data), a clustering method that addresses (1) and in many situations (2), without requiring strong parametric assumptions. We next develop MEDEA (Multivariate Eigenvalue Decomposition Error Adjustment), a weighting system that addresses (2) even in the face of a large number of uninformative variables. We study theoretical aspects of our method and demonstrate their superiority in a series of Monte Carlo simulation studies and a set of real-world applications.
Joint work with A. Foss, B. Ray and A. Hetching
May 13, 2016-11:00-12:00 CTRB 2161/2162
Department of Statistics
University of Georgia
September 23, 2016-11:00-12:00 CTRB 3161/3162
J. Sunil Rao, Ph.D.
Director and Professor
Division of Biostatistics
Department of Public Health Sciences
University of Miami
Title: Classified Mixed Model Prediction
Abstract: Many practical problems are related to prediction, where the main interest is at subject (e.g., personalized medicine) or (small) sub-population (e.g., small community) level. In such cases, it is possible to make substantial gains in prediction accuracy by identifying a class that a new subject belongs to. This way, the new subject is potentially associated with a random effect corresponding to the same class in the training data, so that method of mixed model prediction can be used to make the best prediction. We propose a new method, called classified mixed model prediction (CMMP), to achieve this goal. We develop CMMP for both prediction of mixed effects and prediction of future observations, and consider different scenarios where there may or may not be a “match” of the new subject among the training-data subjects. Theoretical and empirical studies are carried out to study the properties of CMMP, and its comparison with existing methods. In particular, we show that, even if the actual match does not exist between the class of the new observations and those of the training data, CMMP still helps in improving prediction accuracy. Two real-data examples are considered – one coming from a genomic study of breast cancer and the other from a sociology study of school-aged children.
- This is joint work with Jiming Jiang of UC-Davis, Jie Fan of the University of Miami and Thuan Nguyen of Oregon Health and Sciences University.
September 30, 2016-11:00-12:00 CTRB 3161/3162
Professor of Biostatistics and Statistics
Chair, Graduate Program in Biostatistics
University of Pennsylvania
Title: Integrative Analysis for Incorporating the Microbiome to Improve Precision Medicine
Abstract: The gut microbiome impacts health and risk of disease by dynamically integrating signals from the host and its environment. High throughput sequencing technologies enable individualized characterization of the microbiome composition and function. The resulting data can potentially be used for personalized diagnostic assessment, risk stratification, disease prevention and treatment. In this talk, I will present several ongoing microbiome studies at the University of Pennsylvania and provide some empirical evidence of using microbiome in precision medicine. I will talk about some statistical issues related to species abundance quantification, compositional data regression and mediation analysis.
October 13, 2016-11:00-12:00 CTRB 3161/3162
Mosuk Chow, PhD
Senior Scientist & Professor of Statistics
Program director of Master Applied Statistics
Department of Statistics
Penn State University
Title: The Development of Master of Applied Statistics Program at Penn State
Abstract: The Master of Applied Statistics (M.A.S.) program at Penn State was created to meet the strong workforce demand for individuals with sophisticated tools and knowledge to handle and analyze data in the new information age. The development of the M.A.S. program was partially funded by the Alfred P. Sloan Foundation as a Professional Science Masters program. The program was approved by the Penn State Graduate School in 2001 for both the residence program at University Park and for the online program at World Campus. The residence program started admitting student in 2001 but the first online MAS cohort began in Spring 2010. In this talk, our experience developing and offering the MAS program at Penn State will be shared.
October 28, 2016-11:00-12:00 CTRB 2161/2162
Chuanhai Liu, PhD
Professor of Statistics
Title: A Multithreaded and Distributed R for Big Data Analysis
Abstract: The computer software R is one of the most popular computing tools for data analysis. In the past decade or so, tremendous efforts have been made to make R useful for big data analysis. These include Tessera, Revolution-R, and SparkR, to name a few. As we know, they are all making use of JAVA-based softwares such as Hadoop and Spark. In this talk, we introduce an entirely new alternative, a multithreaded and distributed R, called SupR. The prototype of SupR (http://www.stat.purdue.edu/~chuanhai/SupR/index.html) was made possible by modifying R (R-3.1.1) existing internal system implementation with additional ~40K lines of new source code in C. The key features of the prototype include (1) a R-style front-end obtained by maintaining the existing R syntax and internal basic data structures, (2) a Java-like multithreading model, (3) a Spark-like cluster computing environment, and (4) a built-in simple distributed file system. With simple examples, including multithreaded Expectation-Maximization and distributed Linear Regression, we show how SupR can be potentially useful for big data analysis.
November 17, 2016-11:00-12:00 CTRB 2161/2162
Shiva Gautam, PhD
Professor, College of Medicine
University of Florida, Jacksonville
Title: A-kappa: An Index for Agreement Among Multiple Raters
Abstract: Medical data from biomedical studies are often imbalanced with a majority of observations coming from healthy or normal subjects. In the presence of such imbalances, agreement among multiple raters based on Fleiss’ Kappa (FK) produces counterintuitive results. Simulations suggest that the degree of FK’s misrepresentation of the observed agreement may be directly related to the degree of imbalance in the data. We propose a new method, A-Kappa (AK), for evaluating agreement among multiple raters that is not affected by such imbalances. Performance of AK and FK is compared by simulating various degrees of imbalance and the use of the proposed method is illustrated by a real data set. The proposed index of agreement may provide some insight by relating its magnitude to a probability scale whereas existing indices are interpreted arbitrarily.. Computation of both AK and FK may further shed light into the data and be useful in the interpretation and in presenting the results.
December 2, 2016-11:00-12:00 CTRB 3161/3162
Ejaz S. Ahmed
Professor and Dean, Faculty of Mathematics and Science
Department of Mathematics & Statistics
A Journey Through Data to BIG DATA: A Statistician Perspective
Abstract: In this talk, I will shed lights on some historical developments in the arena of so-called big data analysis. We will shed lights on the use and abuse of statistical techniques when analyzing such data. Specifically, I will consider a high-dimensional setting where number of variables are greater than the sample size. In recent literature many penalized regularization strategies are investigated for simultaneous variable selection and post-estimation. Penalty estimation strategy yields good results when the model at hand is assumed to be sparse. However, in a real scenario a model may include both sparse signals and weak signals. In this setting variable selection methods may not distinguish predictors with weak signals and sparse signals and will treat weak signals as sparse signals. The prediction based on a selected submodel may not be fruitful due to selection bias in the submodel. We suggest a high-dimensional shrinkage estimation strategy to improve the prediction performance of a given submodel. The relative performance of the proposed strategy is appraised by numerical studies including application to a real dataset.
January 27, 2017-11:00-12:00 CTRB 3161/3162
Department of Biostatistics
University of Washington
Identity by descent in populations
Abstract: Individuals are identical by descent if they share genetic material due to inheritance from a recent common ancestor. “Unrelated” pairs of people often have detectable segments of identity by descent in their genomes due to very distant relationships. Identity by descent segments are useful for a wide range of genetic analyses including association testing, relationship inference and estimating demographic history. In this talk I will outline probability models for identity by descent and genetic data, and present methods for detecting segments of identity by descent from genetic data in population-based samples. I will also present the concept of effective population size, discuss how it relates to probability models for identity by descent, and present a method for estimating recent effective population size by using identity by descent. I will show results for several populations in Europe and the US.
February 3, 2017-11:00-12:00 CTRB 2161/2162
Sudipto Banerjee, Ph.D.
Professor and Chair
Department of Biostatistics
UCLA Fielding School of Public Health
High-Dimensional Bayesian Geostatistics
Abstract: With the growing capabilities of Geographic Information Systems (GIS) and user-friendly software, statisticians today routinely encounter geographically referenced data containing observations from a large number of spatial locations and time points. Over the last decade, hierarchical spatial-temporal process models have become widely deployed statistical tools for researchers to better understanding the complex nature of spatial and temporal variability. However, fitting hierarchical spatial-temporal models often involves expensive matrix computations with complexity increasing in cubic order for the number of spatial locations and temporal points. This renders such models unfeasible for large data sets. In this talk, I will present some approaches for constructing well-defined spatial-temporal stochastic processes that accrue substantial computational savings. These processes can be used as “priors” for spatial-temporal random fields. Specifically, we will discuss and distinguish between two paradigms: low-rank and sparsity and argue in favor of the latter for achieving massively scalable inference. We construct a well-defined Nearest-Neighbor Gaussian Process (NNGP) that can be exploited as a dimension-reducing prior embedded within a rich and flexible hierarchical modeling framework to deliver exact Bayesian inference. Both these approaches lead to algorithms with floating point operations (flops) that are linear in the number of spatial locations (per iteration). We compare these methods and demonstrate their use in a number of applications and, in particular, in inferring on the spatial-temporal distribution of air pollution in continental Europe using spatial-temporal regression models in conjunction with chemistry transport models.
- This is based upon joint work with Abhirup Datta (Johns Hopkins University) and Andrew O. Finley (Michigan State University).
March 17, 2017–11:00-12:00 CTRB 2161/2162
Arnold J. Stromberg
Professor and Chair
Department of Statistics
University of Kentucky
Feasible Solutions Algorithms: Concepts and Grant Applications
Abstract: Recent work by our group revisited feasible solution algorithms (FSAs) first popularized by Doug Hawkins in the early 1990’s. We use FSAs to find interactions between explanatory variables in predictive models. The generality of the algorithm is both a blessing and a curse. NIH and other funding agencies have been issuing more RFAs for secondary data analysis which is ideal for FSA. This has allowed our team to submit multiple grants relating to FSA, but so far, only one pilot project has been funded. Grant reviews suggest reviewers don’t understand the algorithm, which has led to expansions of the algorithm which leads to publications and improved grant applications, including a recent R01 submission.
June 16, 2017-11:00-12:00 CTRB 2161/2162
Ming-Hui Chen, PhD
Professor, Department of Statistics
University of Connecticut
A Partition Weighted Kernal (PWK) Method for Estimating Marginal Likelihoods with Applications
Abstract: Evaluating the marginal likelihood is essential for model selection. Estimators based on a single Markov chain Monte Carlo sample from the posterior distribution include the harmonic mean estimator and the inflated density ratio estimator. We propose a new class of Monte Carlo estimators based on this single Markov chain Monte Carlo sample. This class can be thought of as a generalization of the harmonic mean and inflated density ratio estimators using a partition weighted kernel (likelihood times prior). We show that our estimator is consistent and has better theoretical properties than the harmonic mean and inflated density ratio estimators. Simulation studies were conducted to examine the empirical performance of the proposed estimator. We further demonstrate the desirable features of the proposed estimator with two real data sets: one is from a prostate cancer study using an ordinal probit regression model with latent variables; the other is for the power prior construction from two Eastern Cooperative Oncology Group phase III clinical trials using the cure rate survival model with similar objectives. When time permits, an extension of the PWK method for computing the marginal likelihoods for variable tree topology space.
- This is joint work with Yu-Bo Wang, Lynn Kuo, and Paul O. Lewis.