Upcoming Seminars

January 25, 2019
Rickey Carter
Department of Biostatistics
Mayo Clinic
CTRB 2161/2162, 11:00am- 12:00pm

February 8, 2019
Jon Shuster
Professor Emeritus
Department of Health Outcomes
University of Florida
CTRB 2161/2162, 11:00am- 12:00pm

Past Seminars

January 8, 2016-11:00-12:00 CTRB 2161
Naihua Duan
Professor of Biostatistics (in Psychiatry)
Division of Biostatistics, Department of Psychiatry, Columbia University

As personalized medicine/precision medicine is emerging as a promising way to improve clinical decision-making, to customize clinical decisions for individual patients to accommodate the unique needs and preferences for each specific patient, there is a growing need for biostatistical methods to be developed and deployed to serve the needs for personalized medicine. As an example, the on-going NIH-funded PREEMPT Study,, is developing a smartphone app that allows chronic pain patients and clinicians to run personalized experiments (n-of-1 trials), comparing two different pain treatments, to help patients and their clinicians to choose the most appropriate pain treatment for each individual patient. Such personalized biostatistical toolkits can be utilized by frontline clinicians and their patients to address the specific clinical questions confronted by each specific patients, to enable the specification and execution of the personalized trial protocol, to facilitate the collection of outcome and process data, to analyze and interpret the data acquired, and to produce reports to the end users to help them with evidence-based decision making. This paradigm exemplifies the potential for “Small Data” (as opposed to “Big Data”) to be deployed in clinical applications for the benefits of both today’s patients (quality improvement) and future patients (human subjects research).

  • Joint work with Richard Kravitz, Christopher Schmid, and the PREEMPT Consortium.

January 22, 2016-11:00-12:00 CTRB 3161/3162
Somnath Datta
Preeminent Professor, Department of Biostatistics, University of Florida

Title: Multi-Sample Adjusted U-Statistics that Account for Confounding Covariates
Multi-sample U-statistics encompass a wide class of test statistics that allow the comparison of two or more distributions. U-statistics are especially powerful because they can be applied to both numeric and non-numeric (e.g., textual) data. However, when comparing the distribution of a variable across two or more groups, observed differences may be due to confounding covariates. For example, in a case-control study, the distribution of exposure in cases may differ from that in controls entirely because of variables that are related to both exposure and case status and are distributed differently among case and control participants. We propose to use individually-reweighted data (using the propensity score for prospective data or the stratification score for retrospective data) to construct adjusted U-statistics that can test the equality of distributions across two (or more) groups in the presence of confounding covariates. Asymptotic normality of our adjusted Y-statistics is established and a closed form expression of their asymptotic variance is presented. The utility of our procedures is demonstrated through simulation studies as well as an analysis of genetic data.

January 29, 2016 – 11:00-12:00 CTRB 3161/3162
Raymond J. Carroll
Distinguished Professor and Jill and Stuart A. Harlin 83 Chair in Statistics
Department of Statistics
Texas A&M University

Title: Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources

Information from various public and private data sources of extremely large sample sizes is now increasingly available for research purposes. Statistical methods are needed for utilizing information from such big data sources while analyzing data from individual studies that may collect more detailed information required for addressing specific hypotheses of interest. We consider the problem of building regression models based on individual-level data from an “internal” study while utilizing summary-level information, such as information on parameters for reduced models, from an “external” big-data source. We identify a set of very constraints that link internal and external models. These constraints are used to develop a framework for semiparametric maximum likelihood inference that allows the distribution of the covariates to be estimated using either the internal sample or an external reference sample. We develop extensions for handling complex stratified sampling designs, such as case-control sampling, for the internal study. Asymptotic theory and variance estimators are developed for each case. We use simulation studies and a real data application to assess the performance of the proposed methods.

April 1, 2016-11:00-12:00 CTRB 3161
Ciprian Crainiceanu
Department of Biostatistics
Johns Hopkins University

Title: Wearable computing and clinical brain imaging
The talk will focus on wearable computing (accelerometers, heart monitors) and structural brain imaging and will introduce three distinct scientific areas. The first part of the talk will focus on movement recognition, identification of circadian patterns of activity and quantification of their association with health outcomes. The second part will focus on structural MRI (sMRI) and its application to longitudinal lesion segmentation, tracking, and quantification in Multiple Sclerosis patients. The third part will be dedicated to the association of stroke location and stroke severity using computed tomography (CT) brain imaging in a large clinical trial. Emphasis will be on the scientific problems and datasets.

General information about Dr. Crainiceanu’s work can be found at, while the most relevant papers for this presentation can be found at

May 6, 2016-11:00-12:00 CTRB 2161/2162
Marianthi Markatou
Associate Chair for Research and Healthcare Informatics
Professor of Biostatistics, SPHHP & SMBS
Assistant Director, Institute for Health Care Informatics
Department of Biostatistics
The State University of New York at Buffalo

A Semi-parametric Method for Clustering Mixed Data

Despite the existence of a large number of clustering algorithms, clustering remains a challenging problem. As large datasets become increasingly common in a number of different domains, it is often the case that clustering algorithms must be applied to heterogeneous sets of variables, creating an acute need for robust and scalable clustering methods for mixed continuous and categorical scale data. We show that current clustering methods for mixed-type data suffer from at least one of two central challenges: (1) they are unable to equitably balance the contribution of continuous and categorical variables without strong parametric assumptions; or (2) they are unable to properly handle data sets in which only a subset of variables are related to the underlying cluster structure of interest. We first develop KAMILA (KAy-means for MIxed LArge data), a clustering method that addresses (1) and in many situations (2), without requiring strong parametric assumptions. We next develop MEDEA (Multivariate Eigenvalue Decomposition Error Adjustment), a weighting system that addresses (2) even in the face of a large number of uninformative variables. We study theoretical aspects of our method and demonstrate their superiority in a series of Monte Carlo simulation studies and a set of real-world applications.

Joint work with A. Foss, B. Ray and A. Hetching

May 13, 2016-11:00-12:00 CTRB 2161/2162
Wenxuan Zhong
Associate Professor
Department of Statistics
University of Georgia

September 23, 2016-11:00-12:00 CTRB 3161/3162
J. Sunil Rao, Ph.D.
Director and Professor
Division of Biostatistics
Department of Public Health Sciences
University of Miami

Title: Classified Mixed Model Prediction

Abstract:  Many practical problems are related to prediction, where the main interest is at subject (e.g., personalized medicine) or (small) sub-population (e.g., small community) level. In such cases, it is possible to make substantial gains in prediction accuracy by identifying a class that a new subject belongs to. This way, the new subject is potentially associated with a random effect corresponding to the same class in the training data, so that method of mixed model prediction can be used to make the best prediction. We propose a new method, called classified mixed model prediction (CMMP), to achieve this goal. We develop CMMP for both prediction of mixed effects and prediction of future observations, and consider different scenarios where there may or may not be a “match” of the new subject among the training-data subjects. Theoretical and empirical studies are carried out to study the properties of CMMP, and its comparison with  existing methods. In particular, we show that, even if the actual match does not exist between the class of the new observations and those of the training data, CMMP still helps in improving prediction accuracy. Two real-data examples are considered – one coming from a genomic study of breast cancer and the other from a sociology study of school-aged children.

  • This is joint work with Jiming Jiang of UC-Davis, Jie Fan of the University of Miami and Thuan Nguyen of Oregon Health and Sciences University.

September 30, 2016-11:00-12:00 CTRB 3161/3162
Hongzhe Li
Professor of Biostatistics and Statistics
Chair, Graduate Program in Biostatistics
University of Pennsylvania

Title:  Integrative Analysis for Incorporating the Microbiome to Improve Precision Medicine

Abstract: The gut microbiome  impacts health and risk of disease by dynamically integrating signals from the host and its environment. High throughput sequencing technologies  enable individualized characterization of the microbiome composition and function.  The resulting data can potentially be used for personalized diagnostic assessment, risk stratification, disease prevention and treatment.  In this talk, I will present several ongoing  microbiome studies at the University of Pennsylvania and provide some empirical evidence of using microbiome in precision medicine. I will talk about some statistical issues related to species abundance quantification, compositional data regression and  mediation analysis.

October 13, 2016-11:00-12:00 CTRB 3161/3162
Mosuk Chow, PhD
Senior Scientist & Professor of Statistics
Program director of Master Applied Statistics
Department of Statistics
Penn State University

Title:  The Development of Master of Applied Statistics Program at Penn State

Abstract:  The Master of Applied Statistics (M.A.S.) program at Penn State was created to meet the strong workforce demand for individuals with sophisticated tools and knowledge to handle and analyze data in the new information age. The development of the M.A.S. program was partially funded by the Alfred P. Sloan Foundation as a Professional Science Masters program.   The program was approved by the Penn State Graduate School in 2001 for both the residence program at University Park and for the online program at World Campus.  The residence program started admitting student in 2001 but the first online MAS cohort began in Spring 2010.  In this talk, our experience developing and offering the MAS program at Penn State will be shared.

October 28, 2016-11:00-12:00 CTRB 2161/2162
Chuanhai Liu, PhD
Professor of Statistics
Purdue University

Title:  A Multithreaded and Distributed R for Big Data Analysis

Abstract: The computer software R is one of the most popular computing tools for data analysis.  In the past decade or so, tremendous efforts have been made to make R useful for big data analysis.  These include Tessera, Revolution-R, and SparkR, to name a few.  As we know, they are all making use of JAVA-based softwares such as Hadoop and Spark.  In this talk, we introduce an entirely new alternative, a multithreaded and distributed R, called SupR.  The prototype of SupR ( was made possible by modifying R (R-3.1.1) existing internal system implementation with additional ~40K lines of new source code in C.  The key features of the prototype include (1) a R-style front-end obtained by maintaining the existing R syntax and internal basic data structures, (2) a Java-like multithreading model, (3) a Spark-like cluster computing environment, and (4) a built-in simple distributed file system.  With simple examples, including multithreaded Expectation-Maximization and distributed Linear Regression, we show how SupR can be potentially useful for big data analysis.

November 17, 2016-11:00-12:00 CTRB 2161/2162
Shiva Gautam, PhD
Professor, College of Medicine
University of Florida, Jacksonville

Title:  A-kappa: An Index for Agreement  Among Multiple Raters

Abstract:  Medical data from biomedical studies are often imbalanced with a majority of observations coming from healthy or normal subjects.    In the presence of such imbalances, agreement among multiple raters based on Fleiss’ Kappa (FK) produces counterintuitive results. Simulations suggest that the degree of FK’s misrepresentation of the observed agreement may be directly related to the degree of imbalance in the data. We propose a new method, A-Kappa (AK),  for evaluating agreement among multiple raters that is not affected by such imbalances.  Performance of AK and FK is compared by simulating various degrees of imbalance and the use of the proposed method is illustrated by a real data set.    The proposed index of agreement may provide some insight by relating its magnitude to a probability scale whereas  existing  indices are interpreted arbitrarily..  Computation of both AK and FK may further shed light into the data and be useful in the interpretation and in presenting the results.

December 2, 2016-11:00-12:00 CTRB 3161/3162
Ejaz S. Ahmed
Professor and Dean, Faculty of Mathematics and Science
Department of Mathematics & Statistics
Brock University


A Journey Through Data to BIG DATA:  A Statistician Perspective

Abstract:  In this talk, I will shed lights on some historical developments in the arena of so-called big data analysis. We will shed lights on the use and abuse of statistical techniques when analyzing such data.   Specifically, I will consider a high-dimensional setting where number of variables are greater than the sample size. In recent literature many penalized regularization strategies are investigated for simultaneous variable selection and post-estimation.  Penalty estimation strategy yields good results when the model at hand is assumed to be sparse.  However, in a real scenario a model may include both sparse signals and weak signals. In this setting variable selection methods may not distinguish predictors with weak signals and sparse signals and will treat weak signals as sparse signals. The prediction based on a selected submodel may not be fruitful due to selection bias in the submodel. We suggest a high-dimensional shrinkage estimation strategy to improve the prediction performance of a given submodel.  The relative performance of the proposed strategy is appraised by numerical studies including application to a real dataset.

January 27, 2017-11:00-12:00 CTRB 3161/3162
Sharon Browning
Associate Professor
Department of Biostatistics
University of Washington


Identity by descent in populations

Abstract:  Individuals are identical by descent if they share genetic material due to inheritance from a recent common ancestor. “Unrelated” pairs of people often have detectable segments of identity by descent in their genomes due to very distant relationships. Identity by descent segments are useful for a wide range of genetic analyses including association testing, relationship inference and estimating demographic history. In this talk I will outline probability models for identity by descent and genetic data, and present methods for detecting segments of identity by descent from genetic data in population-based samples. I will also present the concept of effective population size, discuss how it relates to probability models for identity by descent, and present a method for estimating recent effective population size by using identity by descent. I will show results for several populations in Europe and the US.

February 3, 2017-11:00-12:00 CTRB 2161/2162
Sudipto Banerjee, Ph.D.
Professor and Chair
Department of Biostatistics
UCLA Fielding School of Public Health


High-Dimensional Bayesian Geostatistics

Abstract:  With the growing capabilities of Geographic Information Systems (GIS) and user-friendly software, statisticians today routinely encounter geographically referenced data containing observations from a large number of spatial locations and time points. Over the last decade, hierarchical spatial-temporal process models have become widely deployed statistical tools for researchers to better understanding the complex nature of spatial and temporal variability. However, fitting hierarchical spatial-temporal models often involves expensive matrix computations with complexity increasing in cubic order for the number of spatial locations and temporal points. This renders such models unfeasible for large data sets. In this talk, I will present some approaches for constructing well-defined spatial-temporal stochastic processes that accrue substantial computational savings. These processes can be used as “priors” for spatial-temporal random fields. Specifically, we will discuss and distinguish between two paradigms: low-rank and sparsity and argue in favor of the latter for achieving massively scalable inference. We construct a well-defined Nearest-Neighbor Gaussian Process (NNGP) that can be exploited as a dimension-reducing prior embedded within a rich and flexible hierarchical modeling framework to deliver exact Bayesian inference. Both these approaches lead to algorithms with floating point operations (flops) that are linear in the number of spatial locations (per iteration). We compare these methods and demonstrate their use in a number of applications and, in particular, in inferring on the spatial-temporal distribution of air pollution in continental Europe using spatial-temporal regression models in conjunction with chemistry transport models. 

  • This is based upon joint work with Abhirup Datta (Johns Hopkins University) and Andrew O. Finley (Michigan State University).

March 17, 201711:00-12:00 CTRB 2161/2162
Arnold J. Stromberg
Professor and Chair
Department of Statistics
University of Kentucky


Feasible Solutions Algorithms: Concepts and Grant Applications

Abstract:  Recent work by our group revisited feasible solution algorithms (FSAs) first popularized by Doug Hawkins in the early 1990’s. We use FSAs to find interactions between explanatory variables in predictive models. The generality of the algorithm is both a blessing and a curse. NIH and other funding agencies have been issuing more RFAs for secondary data analysis which is ideal for FSA. This has allowed our team to submit multiple grants relating to FSA, but so far, only one pilot project has been funded. Grant reviews suggest reviewers don’t understand the algorithm, which has led to expansions of the algorithm which leads to publications and improved grant applications, including a recent R01 submission.

June 16, 2017-11:00-12:00 CTRB 2161/2162
Ming-Hui Chen, PhD
Professor, Department of Statistics
University of Connecticut


A Partition Weighted Kernal (PWK) Method for Estimating Marginal Likelihoods with Applications

Abstract:  Evaluating the marginal likelihood is essential for model selection. Estimators based on a single Markov chain Monte Carlo sample from the posterior distribution include the harmonic mean estimator and the inflated density ratio estimator. We propose a new class of Monte Carlo estimators based on this single Markov chain Monte Carlo sample. This class can be thought of as a generalization of the harmonic mean and inflated density ratio estimators using a partition weighted kernel (likelihood times prior). We show that our estimator is consistent and has better theoretical properties than the harmonic mean and inflated density ratio estimators. Simulation studies were conducted to examine the empirical performance of the proposed estimator. We further demonstrate the desirable features of the proposed estimator with two real data sets: one is from a prostate cancer study using an ordinal probit regression model with latent variables; the other is for the power prior construction from two Eastern Cooperative Oncology Group phase III clinical trials using the cure rate survival model with similar objectives. When time permits, an extension of the PWK method for computing the marginal likelihoods for variable tree topology space.

  • This is joint work with Yu-Bo Wang, Lynn Kuo, and Paul O. Lewis.

November 9, 2017-11:00-12:00 CTRB 2161

Michael G. Hudgens
Professor, Department of Biostatistics, University of North Carolina, Chapel Hill
Director, Biostatistics Core, Center for AIDS Research, UNC

Title: Causal Inference in the Presence of Interference

A fundamental assumption usually made in causal inference is that of no interference between individuals (or units), i.e., the potential outcomes of one individual are assumed to be unaffected by the treatment assignment of other individuals.  However, in many settings, this assumption obviously does not hold.  For example, in infectious diseases, whether one person becomes infected depends on who else in the population is vaccinated.  In this talk we will discuss recent approaches to assessing treatment effects in the presence of interference.  Inference about different direct and indirect (or spillover) effects will be considered in a population where individuals form groups such that interference is possible between individuals within the same group but not between individuals in different groups.  Analyses of a cholera vaccine study in over 100,000 individuals in Matlab, Bangladesh will be presented which indicate a significant indirect effect of vaccination.

CTSI and The Department of Biostatistics Special Invited Seminar
HPNP G 114, November 15, 2017

Tony Barr, M.S
A Model of Reality Inc.

Title: The Search for Understanding: A Perspective from SAS Creator

Abstract: My childhood dream of being an inventor led me to major in physics at North Carolina State University.  However, I became fascinated with computers and took an assistantship programming with the Statistics Department.  This led me to create the SAS language to bring statistics to the people.  The theme of understanding gradually became a primary motivator for me in trying to write programs in many different domains.  The history of SAS and the future of computing will be covered. A picture of the future is presented where people understand programming and programming becomes a universal skill.

January 27, 2017-11:00am- 12:00pm CTRB 3161/3162
Sharon Browning
Associate Professor
Department of Biostatistics
University of Washington

Title: Identity by descent in populations

Abstract: Individuals are identical by descent if they share genetic material due to inheritance from a recent common ancestor. “Unrelated” pairs of people often have detectable segments of identity by descent in their genomes due to very distant relationships. Identity by descent segments are useful for a wide range of genetic analyses including association testing, relationship inference and estimating demographic history. In this talk I will outline probability models for identity by descent and genetic data, and present methods for detecting segments of identity by descent  from genetic data in population-based samples. I will also present the concept of effective population size, discuss how it relates to probability models for identity by descent, and present a method for estimating recent effective population size by using identity by descent. I will show results for several populations in Europe and the US.

August 24, 2018-11:00am- 12:00pm CTRB 2161/2162
Kasper Hansen
Assistant Professor
Department of Biostatistics
Johns Hopkins Bloomberg School of Public Health

Title: Analyzing bisulfite sequencing data from human brain regions

Abstract: Epigenetic marks in the human brain have been hypothesized to be associated with disease. Towards understanding this, we have profiled DNA methylation, chromatin accessibility, and gene expression in multiple different human brain regions in fractionated cell populations. We will discuss statistical approaches to analyzing such data, which includes local likelihood smoothing and scalable computing.

September 28, 2018-11:00am-12:00pm CTRB 3161/3162
Limin Peng
Associate Professor
Department of Biostatistics and Bioinformatics
Rollins School of Public Health
Emory University

Title: Trajectory Quantile Regression for Longitudinal Data

Abstract: Quantile regression has demonstrated promising utility in longitudinal data analysis. Existing work is primarily focused on modeling cross-sectional outcomes, while outcome trajectories often carry more substantive information in practice. In this work, we develop a trajectory quantile regression framework that is designed to robustly and flexibly investigate how latent individual trajectory features are related to observed subject characteristics. The proposed models are built under multilevel modeling with usual parametric assumptions lifted or relaxed. We derive our estimation procedure by novelly transforming the problem at hand to quantile regression with perturbed responses and adapting the bias correction technique for handling covariate measurement errors. We establish desirable asymptotic properties of the proposed estimator, including uniform consistency and weak convergence. Extensive simulation studies confirm the validity of the proposed method as well as its robustness. An application to the DURABLE trial uncovers sensible scientific findings and illustrates the practical value of our proposals.

October 5, 2018-11:00am-12:00pm CTRB 2161/2162
Vasyl Pihur
Privacy/Security Engineering Manager
Snapchat, Inc.

Title: Theory and Practice of Differential Privacy

Abstract: Our understanding of how to provide strong privacy guarantees to individuals, entities or groups was revolutionized about 10 years ago with the new mathematical definition of differential privacy. Despite being developed and pursued mostly by the computer science research community, differential privacy has fundamental underpinnings in statistics and statistical inference. After all, the goal is still to learn population-level quantities, but with an additional limitations on inference around individual data points. For example, how does one efficiently learn the mean age of a group of people without the ability to learn anyone’s age individually. At the core, the goal is to enable one type of inference, while making the other one as difficult as possible. In this talk, we will explore learning marginal and joint discrete distributions using Rappor and conditional expectations using DDML under the local form of differential privacy

October 11, 2018-11:00am-12:00pm CTRB 3161/3162
Ying Zhang
Professor & Director of Education
Department of Biostatistics
Indiana University

Title: Semiparametric Analysis of Longitudinal Data Anchored by Interval-Censored Events

Abstract: In many longitudinal studies, outcomes are assessed on time scales anchored by certain clinical events. When the anchoring events are unobserved, the study timeline becomes undefined, and the traditional longitudinal analysis loses its temporal reference. We consider the analytical situations where the anchoring events are interval censored. We show that by expressing the regression parameter estimators as stochastic functionals of a plug-in estimate of the unknown anchoring event distribution, the standard longitudinal models can be modified and extended to accommodate the less well defined time scale. This extension enhances the existing tools for longitudinal data analysis. Under mild regularity conditions, we show that for a broad class of models, including the frequently used generalized mixed-effects models, the functional parameter estimates are consistent and asymptotically normally distributed with a n1/2 convergence rate. To implement, we developed a hybrid computational procedure combining the strengths of the Fisher’s scoring method and the expectation-expectation (EM) algorithm. We conducted a simulation study to validate the asymptotic properties, and to examine the finite sample performance of the proposed method. A real data analysis was used to illustrate the proposed method.

November 16, 2018-11:30am-12:30pm CTRB 2161/2162
Joseph Antonelli
Department of Statistics
University of Florida

Title: Estimating the health effects of environmental mixtures using Bayesian semiparametric regression and sparsity inducing priors

Abstract: Humans are routinely exposed to mixtures of chemical and other environmental factors, making the quantification of health effects associated with environmental mixtures a critical goal for establishing environmental policy sufficiently protective of human health. The quantification of the effects of exposure to an environmental mixture poses several statistical challenges. It is often the case that exposure to multiple pollutants interact with each other to affect an outcome. Further, the exposure-response relationship between an outcome and some exposures can be beneficial and detrimental at different ranges of exposure. To estimate the health effects of complex mixtures we propose a flexible Bayesian approach that allows exposures to interact with each other and have nonlinear relationships with the outcome. We induce sparsity using multivariate spike and slab priors to determine which exposures are associated with the outcome, and which exposures interact with each other. The proposed approach is interpretable, as we can use the posterior probabilities of inclusion into the model to identify pollutants that interact with each other. We illustrate our approach’s ability to estimate complex functions using simulated data, and apply our method to two studies to determine which environmental pollutants adversely affect health.

November 19, 2018
Qing Lu
Faculty Candidate
Department of Epidemiology and Biostatistics
Michigan State University
CTRB 3161/3162, 11:00am- 12:00pm