Upcoming Seminars


Past Seminars

January 8, 2016-11:00-12:00 CTRB 2161
Naihua Duan
Professor of Biostatistics (in Psychiatry)
Division of Biostatistics, Department of Psychiatry, Columbia University

As personalized medicine/precision medicine is emerging as a promising way to improve clinical decision-making, to customize clinical decisions for individual patients to accommodate the unique needs and preferences for each specific patient, there is a growing need for biostatistical methods to be developed and deployed to serve the needs for personalized medicine. As an example, the on-going NIH-funded PREEMPT Study,, is developing a smartphone app that allows chronic pain patients and clinicians to run personalized experiments (n-of-1 trials), comparing two different pain treatments, to help patients and their clinicians to choose the most appropriate pain treatment for each individual patient. Such personalized biostatistical toolkits can be utilized by frontline clinicians and their patients to address the specific clinical questions confronted by each specific patients, to enable the specification and execution of the personalized trial protocol, to facilitate the collection of outcome and process data, to analyze and interpret the data acquired, and to produce reports to the end users to help them with evidence-based decision making. This paradigm exemplifies the potential for “Small Data” (as opposed to “Big Data”) to be deployed in clinical applications for the benefits of both today’s patients (quality improvement) and future patients (human subjects research).

  • Joint work with Richard Kravitz, Christopher Schmid, and the PREEMPT Consortium.

January 22, 2016-11:00-12:00 CTRB 3161/3162
Somnath Datta
Preeminent Professor, Department of Biostatistics, University of Florida

Title: Multi-Sample Adjusted U-Statistics that Account for Confounding Covariates
Multi-sample U-statistics encompass a wide class of test statistics that allow the comparison of two or more distributions. U-statistics are especially powerful because they can be applied to both numeric and non-numeric (e.g., textual) data. However, when comparing the distribution of a variable across two or more groups, observed differences may be due to confounding covariates. For example, in a case-control study, the distribution of exposure in cases may differ from that in controls entirely because of variables that are related to both exposure and case status and are distributed differently among case and control participants. We propose to use individually-reweighted data (using the propensity score for prospective data or the stratification score for retrospective data) to construct adjusted U-statistics that can test the equality of distributions across two (or more) groups in the presence of confounding covariates. Asymptotic normality of our adjusted Y-statistics is established and a closed form expression of their asymptotic variance is presented. The utility of our procedures is demonstrated through simulation studies as well as an analysis of genetic data.

January 29, 2016 – 11:00-12:00 CTRB 3161/3162
Raymond J. Carroll
Distinguished Professor and Jill and Stuart A. Harlin 83 Chair in Statistics
Department of Statistics
Texas A&M University

Title: Constrained maximum likelihood estimation for model calibration using summary-level information from external big data sources

Information from various public and private data sources of extremely large sample sizes is now increasingly available for research purposes. Statistical methods are needed for utilizing information from such big data sources while analyzing data from individual studies that may collect more detailed information required for addressing specific hypotheses of interest. We consider the problem of building regression models based on individual-level data from an “internal” study while utilizing summary-level information, such as information on parameters for reduced models, from an “external” big-data source. We identify a set of very constraints that link internal and external models. These constraints are used to develop a framework for semiparametric maximum likelihood inference that allows the distribution of the covariates to be estimated using either the internal sample or an external reference sample. We develop extensions for handling complex stratified sampling designs, such as case-control sampling, for the internal study. Asymptotic theory and variance estimators are developed for each case. We use simulation studies and a real data application to assess the performance of the proposed methods.

April 1, 2016-11:00-12:00 CTRB 3161
Ciprian Crainiceanu
Department of Biostatistics
Johns Hopkins University

Title: Wearable computing and clinical brain imaging
The talk will focus on wearable computing (accelerometers, heart monitors) and structural brain imaging and will introduce three distinct scientific areas. The first part of the talk will focus on movement recognition, identification of circadian patterns of activity and quantification of their association with health outcomes. The second part will focus on structural MRI (sMRI) and its application to longitudinal lesion segmentation, tracking, and quantification in Multiple Sclerosis patients. The third part will be dedicated to the association of stroke location and stroke severity using computed tomography (CT) brain imaging in a large clinical trial. Emphasis will be on the scientific problems and datasets.

General information about Dr. Crainiceanu’s work can be found at, while the most relevant papers for this presentation can be found at

May 6, 2016-11:00-12:00 CTRB 2161/2162
Marianthi Markatou
Associate Chair for Research and Healthcare Informatics
Professor of Biostatistics, SPHHP & SMBS
Assistant Director, Institute for Health Care Informatics
Department of Biostatistics
The State University of New York at Buffalo

A Semi-parametric Method for Clustering Mixed Data

Despite the existence of a large number of clustering algorithms, clustering remains a challenging problem. As large datasets become increasingly common in a number of different domains, it is often the case that clustering algorithms must be applied to heterogeneous sets of variables, creating an acute need for robust and scalable clustering methods for mixed continuous and categorical scale data. We show that current clustering methods for mixed-type data suffer from at least one of two central challenges: (1) they are unable to equitably balance the contribution of continuous and categorical variables without strong parametric assumptions; or (2) they are unable to properly handle data sets in which only a subset of variables are related to the underlying cluster structure of interest. We first develop KAMILA (KAy-means for MIxed LArge data), a clustering method that addresses (1) and in many situations (2), without requiring strong parametric assumptions. We next develop MEDEA (Multivariate Eigenvalue Decomposition Error Adjustment), a weighting system that addresses (2) even in the face of a large number of uninformative variables. We study theoretical aspects of our method and demonstrate their superiority in a series of Monte Carlo simulation studies and a set of real-world applications.

Joint work with A. Foss, B. Ray and A. Hetching

May 13, 2016-11:00-12:00 CTRB 2161/2162
Wenxuan Zhong
Associate Professor
Department of Statistics
University of Georgia

September 23, 2016-11:00-12:00 CTRB 3161/3162
J. Sunil Rao, Ph.D.
Director and Professor
Division of Biostatistics
Department of Public Health Sciences
University of Miami

Title: Classified Mixed Model Prediction

Abstract:  Many practical problems are related to prediction, where the main interest is at subject (e.g., personalized medicine) or (small) sub-population (e.g., small community) level. In such cases, it is possible to make substantial gains in prediction accuracy by identifying a class that a new subject belongs to. This way, the new subject is potentially associated with a random effect corresponding to the same class in the training data, so that method of mixed model prediction can be used to make the best prediction. We propose a new method, called classified mixed model prediction (CMMP), to achieve this goal. We develop CMMP for both prediction of mixed effects and prediction of future observations, and consider different scenarios where there may or may not be a “match” of the new subject among the training-data subjects. Theoretical and empirical studies are carried out to study the properties of CMMP, and its comparison with  existing methods. In particular, we show that, even if the actual match does not exist between the class of the new observations and those of the training data, CMMP still helps in improving prediction accuracy. Two real-data examples are considered – one coming from a genomic study of breast cancer and the other from a sociology study of school-aged children.

  • This is joint work with Jiming Jiang of UC-Davis, Jie Fan of the University of Miami and Thuan Nguyen of Oregon Health and Sciences University.

September 30, 2016-11:00-12:00 CTRB 3161/3162
Hongzhe Li
Professor of Biostatistics and Statistics
Chair, Graduate Program in Biostatistics
University of Pennsylvania

Title:  Integrative Analysis for Incorporating the Microbiome to Improve Precision Medicine

Abstract: The gut microbiome  impacts health and risk of disease by dynamically integrating signals from the host and its environment. High throughput sequencing technologies  enable individualized characterization of the microbiome composition and function.  The resulting data can potentially be used for personalized diagnostic assessment, risk stratification, disease prevention and treatment.  In this talk, I will present several ongoing  microbiome studies at the University of Pennsylvania and provide some empirical evidence of using microbiome in precision medicine. I will talk about some statistical issues related to species abundance quantification, compositional data regression and  mediation analysis.

October 13, 2016-11:00-12:00 CTRB 3161/3162
Mosuk Chow, PhD
Senior Scientist & Professor of Statistics
Program director of Master Applied Statistics
Department of Statistics
Penn State University

Title:  The Development of Master of Applied Statistics Program at Penn State

Abstract:  The Master of Applied Statistics (M.A.S.) program at Penn State was created to meet the strong workforce demand for individuals with sophisticated tools and knowledge to handle and analyze data in the new information age. The development of the M.A.S. program was partially funded by the Alfred P. Sloan Foundation as a Professional Science Masters program.   The program was approved by the Penn State Graduate School in 2001 for both the residence program at University Park and for the online program at World Campus.  The residence program started admitting student in 2001 but the first online MAS cohort began in Spring 2010.  In this talk, our experience developing and offering the MAS program at Penn State will be shared.

October 28, 2016-11:00-12:00 CTRB 2161/2162
Chuanhai Liu, PhD
Professor of Statistics
Purdue University

Title:  A Multithreaded and Distributed R for Big Data Analysis

Abstract: The computer software R is one of the most popular computing tools for data analysis.  In the past decade or so, tremendous efforts have been made to make R useful for big data analysis.  These include Tessera, Revolution-R, and SparkR, to name a few.  As we know, they are all making use of JAVA-based softwares such as Hadoop and Spark.  In this talk, we introduce an entirely new alternative, a multithreaded and distributed R, called SupR.  The prototype of SupR ( was made possible by modifying R (R-3.1.1) existing internal system implementation with additional ~40K lines of new source code in C.  The key features of the prototype include (1) a R-style front-end obtained by maintaining the existing R syntax and internal basic data structures, (2) a Java-like multithreading model, (3) a Spark-like cluster computing environment, and (4) a built-in simple distributed file system.  With simple examples, including multithreaded Expectation-Maximization and distributed Linear Regression, we show how SupR can be potentially useful for big data analysis.

November 17, 2016-11:00-12:00 CTRB 2161/2162
Shiva Gautam, PhD
Professor, College of Medicine
University of Florida, Jacksonville

Title:  A-kappa: An Index for Agreement  Among Multiple Raters

Abstract:  Medical data from biomedical studies are often imbalanced with a majority of observations coming from healthy or normal subjects.    In the presence of such imbalances, agreement among multiple raters based on Fleiss’ Kappa (FK) produces counterintuitive results. Simulations suggest that the degree of FK’s misrepresentation of the observed agreement may be directly related to the degree of imbalance in the data. We propose a new method, A-Kappa (AK),  for evaluating agreement among multiple raters that is not affected by such imbalances.  Performance of AK and FK is compared by simulating various degrees of imbalance and the use of the proposed method is illustrated by a real data set.    The proposed index of agreement may provide some insight by relating its magnitude to a probability scale whereas  existing  indices are interpreted arbitrarily..  Computation of both AK and FK may further shed light into the data and be useful in the interpretation and in presenting the results.

December 2, 2016-11:00-12:00 CTRB 3161/3162
Ejaz S. Ahmed
Professor and Dean, Faculty of Mathematics and Science
Department of Mathematics & Statistics
Brock University


A Journey Through Data to BIG DATA:  A Statistician Perspective

Abstract:  In this talk, I will shed lights on some historical developments in the arena of so-called big data analysis. We will shed lights on the use and abuse of statistical techniques when analyzing such data.   Specifically, I will consider a high-dimensional setting where number of variables are greater than the sample size. In recent literature many penalized regularization strategies are investigated for simultaneous variable selection and post-estimation.  Penalty estimation strategy yields good results when the model at hand is assumed to be sparse.  However, in a real scenario a model may include both sparse signals and weak signals. In this setting variable selection methods may not distinguish predictors with weak signals and sparse signals and will treat weak signals as sparse signals. The prediction based on a selected submodel may not be fruitful due to selection bias in the submodel. We suggest a high-dimensional shrinkage estimation strategy to improve the prediction performance of a given submodel.  The relative performance of the proposed strategy is appraised by numerical studies including application to a real dataset.

January 27, 2017-11:00-12:00 CTRB 3161/3162
Sharon Browning
Associate Professor
Department of Biostatistics
University of Washington


Identity by descent in populations

Abstract:  Individuals are identical by descent if they share genetic material due to inheritance from a recent common ancestor. “Unrelated” pairs of people often have detectable segments of identity by descent in their genomes due to very distant relationships. Identity by descent segments are useful for a wide range of genetic analyses including association testing, relationship inference and estimating demographic history. In this talk I will outline probability models for identity by descent and genetic data, and present methods for detecting segments of identity by descent from genetic data in population-based samples. I will also present the concept of effective population size, discuss how it relates to probability models for identity by descent, and present a method for estimating recent effective population size by using identity by descent. I will show results for several populations in Europe and the US.

February 3, 2017-11:00-12:00 CTRB 2161/2162
Sudipto Banerjee, Ph.D.
Professor and Chair
Department of Biostatistics
UCLA Fielding School of Public Health


High-Dimensional Bayesian Geostatistics

Abstract:  With the growing capabilities of Geographic Information Systems (GIS) and user-friendly software, statisticians today routinely encounter geographically referenced data containing observations from a large number of spatial locations and time points. Over the last decade, hierarchical spatial-temporal process models have become widely deployed statistical tools for researchers to better understanding the complex nature of spatial and temporal variability. However, fitting hierarchical spatial-temporal models often involves expensive matrix computations with complexity increasing in cubic order for the number of spatial locations and temporal points. This renders such models unfeasible for large data sets. In this talk, I will present some approaches for constructing well-defined spatial-temporal stochastic processes that accrue substantial computational savings. These processes can be used as “priors” for spatial-temporal random fields. Specifically, we will discuss and distinguish between two paradigms: low-rank and sparsity and argue in favor of the latter for achieving massively scalable inference. We construct a well-defined Nearest-Neighbor Gaussian Process (NNGP) that can be exploited as a dimension-reducing prior embedded within a rich and flexible hierarchical modeling framework to deliver exact Bayesian inference. Both these approaches lead to algorithms with floating point operations (flops) that are linear in the number of spatial locations (per iteration). We compare these methods and demonstrate their use in a number of applications and, in particular, in inferring on the spatial-temporal distribution of air pollution in continental Europe using spatial-temporal regression models in conjunction with chemistry transport models. 

  • This is based upon joint work with Abhirup Datta (Johns Hopkins University) and Andrew O. Finley (Michigan State University).

March 17, 201711:00-12:00 CTRB 2161/2162
Arnold J. Stromberg
Professor and Chair
Department of Statistics
University of Kentucky


Feasible Solutions Algorithms: Concepts and Grant Applications

Abstract:  Recent work by our group revisited feasible solution algorithms (FSAs) first popularized by Doug Hawkins in the early 1990’s. We use FSAs to find interactions between explanatory variables in predictive models. The generality of the algorithm is both a blessing and a curse. NIH and other funding agencies have been issuing more RFAs for secondary data analysis which is ideal for FSA. This has allowed our team to submit multiple grants relating to FSA, but so far, only one pilot project has been funded. Grant reviews suggest reviewers don’t understand the algorithm, which has led to expansions of the algorithm which leads to publications and improved grant applications, including a recent R01 submission.

June 16, 2017-11:00-12:00 CTRB 2161/2162
Ming-Hui Chen, PhD
Professor, Department of Statistics
University of Connecticut


A Partition Weighted Kernal (PWK) Method for Estimating Marginal Likelihoods with Applications

Abstract:  Evaluating the marginal likelihood is essential for model selection. Estimators based on a single Markov chain Monte Carlo sample from the posterior distribution include the harmonic mean estimator and the inflated density ratio estimator. We propose a new class of Monte Carlo estimators based on this single Markov chain Monte Carlo sample. This class can be thought of as a generalization of the harmonic mean and inflated density ratio estimators using a partition weighted kernel (likelihood times prior). We show that our estimator is consistent and has better theoretical properties than the harmonic mean and inflated density ratio estimators. Simulation studies were conducted to examine the empirical performance of the proposed estimator. We further demonstrate the desirable features of the proposed estimator with two real data sets: one is from a prostate cancer study using an ordinal probit regression model with latent variables; the other is for the power prior construction from two Eastern Cooperative Oncology Group phase III clinical trials using the cure rate survival model with similar objectives. When time permits, an extension of the PWK method for computing the marginal likelihoods for variable tree topology space.

  • This is joint work with Yu-Bo Wang, Lynn Kuo, and Paul O. Lewis.

November 9, 2017-11:00-12:00 CTRB 2161

Michael G. Hudgens
Professor, Department of Biostatistics, University of North Carolina, Chapel Hill
Director, Biostatistics Core, Center for AIDS Research, UNC

Title: Causal Inference in the Presence of Interference

A fundamental assumption usually made in causal inference is that of no interference between individuals (or units), i.e., the potential outcomes of one individual are assumed to be unaffected by the treatment assignment of other individuals.  However, in many settings, this assumption obviously does not hold.  For example, in infectious diseases, whether one person becomes infected depends on who else in the population is vaccinated.  In this talk we will discuss recent approaches to assessing treatment effects in the presence of interference.  Inference about different direct and indirect (or spillover) effects will be considered in a population where individuals form groups such that interference is possible between individuals within the same group but not between individuals in different groups.  Analyses of a cholera vaccine study in over 100,000 individuals in Matlab, Bangladesh will be presented which indicate a significant indirect effect of vaccination.

CTSI and The Department of Biostatistics Special Invited Seminar
HPNP G 114, November 15, 2017

Tony Barr, M.S
A Model of Reality Inc.

Title: The Search for Understanding: A Perspective from SAS Creator

Abstract: My childhood dream of being an inventor led me to major in physics at North Carolina State University.  However, I became fascinated with computers and took an assistantship programming with the Statistics Department.  This led me to create the SAS language to bring statistics to the people.  The theme of understanding gradually became a primary motivator for me in trying to write programs in many different domains.  The history of SAS and the future of computing will be covered. A picture of the future is presented where people understand programming and programming becomes a universal skill.

January 27, 2017-11:00am- 12:00pm CTRB 3161/3162
Sharon Browning
Associate Professor
Department of Biostatistics
University of Washington

Title: Identity by descent in populations

Abstract: Individuals are identical by descent if they share genetic material due to inheritance from a recent common ancestor. “Unrelated” pairs of people often have detectable segments of identity by descent in their genomes due to very distant relationships. Identity by descent segments are useful for a wide range of genetic analyses including association testing, relationship inference and estimating demographic history. In this talk I will outline probability models for identity by descent and genetic data, and present methods for detecting segments of identity by descent  from genetic data in population-based samples. I will also present the concept of effective population size, discuss how it relates to probability models for identity by descent, and present a method for estimating recent effective population size by using identity by descent. I will show results for several populations in Europe and the US.

August 24, 2018-11:00am- 12:00pm CTRB 2161/2162
Kasper Hansen
Assistant Professor
Department of Biostatistics
Johns Hopkins Bloomberg School of Public Health

Title: Analyzing bisulfite sequencing data from human brain regions

Abstract: Epigenetic marks in the human brain have been hypothesized to be associated with disease. Towards understanding this, we have profiled DNA methylation, chromatin accessibility, and gene expression in multiple different human brain regions in fractionated cell populations. We will discuss statistical approaches to analyzing such data, which includes local likelihood smoothing and scalable computing.

September 28, 2018-11:00am-12:00pm CTRB 3161/3162
Limin Peng
Associate Professor
Department of Biostatistics and Bioinformatics
Rollins School of Public Health
Emory University

Title: Trajectory Quantile Regression for Longitudinal Data

Abstract: Quantile regression has demonstrated promising utility in longitudinal data analysis. Existing work is primarily focused on modeling cross-sectional outcomes, while outcome trajectories often carry more substantive information in practice. In this work, we develop a trajectory quantile regression framework that is designed to robustly and flexibly investigate how latent individual trajectory features are related to observed subject characteristics. The proposed models are built under multilevel modeling with usual parametric assumptions lifted or relaxed. We derive our estimation procedure by novelly transforming the problem at hand to quantile regression with perturbed responses and adapting the bias correction technique for handling covariate measurement errors. We establish desirable asymptotic properties of the proposed estimator, including uniform consistency and weak convergence. Extensive simulation studies confirm the validity of the proposed method as well as its robustness. An application to the DURABLE trial uncovers sensible scientific findings and illustrates the practical value of our proposals.

October 5, 2018-11:00am-12:00pm CTRB 2161/2162
Vasyl Pihur
Privacy/Security Engineering Manager
Snapchat, Inc.

Title: Theory and Practice of Differential Privacy

Abstract: Our understanding of how to provide strong privacy guarantees to individuals, entities or groups was revolutionized about 10 years ago with the new mathematical definition of differential privacy. Despite being developed and pursued mostly by the computer science research community, differential privacy has fundamental underpinnings in statistics and statistical inference. After all, the goal is still to learn population-level quantities, but with an additional limitations on inference around individual data points. For example, how does one efficiently learn the mean age of a group of people without the ability to learn anyone’s age individually. At the core, the goal is to enable one type of inference, while making the other one as difficult as possible. In this talk, we will explore learning marginal and joint discrete distributions using Rappor and conditional expectations using DDML under the local form of differential privacy

October 11, 2018-11:00am-12:00pm CTRB 3161/3162
Ying Zhang
Professor & Director of Education
Department of Biostatistics
Indiana University

Title: Semiparametric Analysis of Longitudinal Data Anchored by Interval-Censored Events

Abstract: In many longitudinal studies, outcomes are assessed on time scales anchored by certain clinical events. When the anchoring events are unobserved, the study timeline becomes undefined, and the traditional longitudinal analysis loses its temporal reference. We consider the analytical situations where the anchoring events are interval censored. We show that by expressing the regression parameter estimators as stochastic functionals of a plug-in estimate of the unknown anchoring event distribution, the standard longitudinal models can be modified and extended to accommodate the less well defined time scale. This extension enhances the existing tools for longitudinal data analysis. Under mild regularity conditions, we show that for a broad class of models, including the frequently used generalized mixed-effects models, the functional parameter estimates are consistent and asymptotically normally distributed with a n1/2 convergence rate. To implement, we developed a hybrid computational procedure combining the strengths of the Fisher’s scoring method and the expectation-expectation (EM) algorithm. We conducted a simulation study to validate the asymptotic properties, and to examine the finite sample performance of the proposed method. A real data analysis was used to illustrate the proposed method.

November 16, 2018-11:30am-12:30pm CTRB 2161/2162
Joseph Antonelli
Department of Statistics
University of Florida

Title: Estimating the health effects of environmental mixtures using Bayesian semiparametric regression and sparsity inducing priors

Abstract: Humans are routinely exposed to mixtures of chemical and other environmental factors, making the quantification of health effects associated with environmental mixtures a critical goal for establishing environmental policy sufficiently protective of human health. The quantification of the effects of exposure to an environmental mixture poses several statistical challenges. It is often the case that exposure to multiple pollutants interact with each other to affect an outcome. Further, the exposure-response relationship between an outcome and some exposures can be beneficial and detrimental at different ranges of exposure. To estimate the health effects of complex mixtures we propose a flexible Bayesian approach that allows exposures to interact with each other and have nonlinear relationships with the outcome. We induce sparsity using multivariate spike and slab priors to determine which exposures are associated with the outcome, and which exposures interact with each other. The proposed approach is interpretable, as we can use the posterior probabilities of inclusion into the model to identify pollutants that interact with each other. We illustrate our approach’s ability to estimate complex functions using simulated data, and apply our method to two studies to determine which environmental pollutants adversely affect health.

November 19, 2018-11:00am-12:00pm CTRB 3161/3162
Qing Lu
Faculty Candidate
Associate Professor, Department of Epidemiology and Biostatistics
Michigan State University

Title: A Kernel-Based Neural Network for High-dimensional Risk Prediction on Massive Genetic Data

Abstract: Artificial intelligence (AI) is a thriving research field with many successful applications in areas such as imaging and speech recognition. Neural-network-based methods, such as deep learning, play a central role in the modern AI technology. While neural-network-based methods also hold great promise for genetic research, the high-dimensionality of genetic data, the massive amounts of study samples, and complex relationships between genetic variants and disease outcomes create tremendous analytic and computational challenges. To address these challenges, we propose a kernel-based neural network (KNN) method. KNN inherits features from both linear mixed models and classical neural networks, and is designed for high-dimensional genetic risk prediction analysis. KNN summarizes a large number of genetic variants into kernel matrices and uses the kernel matrices as input matrices. Based on the kernel matrices, KNN builds a feedforward neural network to model the complex relationship between genetic variants and a disease outcome. Minimum norm quadratic unbiased estimation and batch training are implemented in KNN to accelerate the parameter estimation, making KNN applicable to large-scale datasets with millions of samples. Through simulations and a real data application, we demonstrate the advantages of KNN over LMM in terms of prediction accuracy and computational efficiency.

December 3, 2018-11:00am-12:00pm CTRB 3161/3162
Tingting Zhang
Faculty Candidate
Associate Professor, Department of Statistics
University of Virginia

Title: A Bayesian Stochastic-Blockmodel-based Approach for Mapping Epileptic Brain Networks

Abstract: The human brain is a dynamic system consisting of many consistently interacting regions. The brain regions and the influences exerted by each region over another, called directional connectivity, form a directional network. We study normal and abnormal directional brain networks of epileptic patients using their intracranial EEG (iEEG) data, which are multivariate time series recordings of many small brain regions. We propose a high-dimensional state-space multivariate autoregression model (SSMAR) for iEEG data. To characterize brain networks with a commonly reported cluster structure, we use a stochastic-block-model-motivated prior for possible network patterns in the SSMAR. We develop a Bayesian framework to estimate the proposed high-dimensional model, examine the probabilities of nonzero directional connectivity among every pair of regions, identify clusters of densely-connected brain regions, and map epileptic patients’ brain networks in different seizure stages. We show through both simulation and real data analysis that the new method outperforms existing network methods by being flexible to characterize various high-dimensional network patterns and robust to violation of model assumptions, low iEEG sampling frequency, and data noise. Applying the developed SSMAR and Bayesian approach to an epileptic patient’s iEEG data, we reveal the patient’s network changes at the seizure onset and the unique connectivity of the seizure onset zone (SOZ), where seizures start and spread to other normal regions. Using this network result, our method has a potential to assist clinicians to localize the SOZ, a long standing research focus in epilepsy diagnosis and treatment.

December 20, 2018-10:00am-11:00am CTRB 2161/2162
Fan Li
Faculty Candidate
Graduate Research Assistant, Department of Biostatistics and Bioinformatics
Duke University

Title: Propensity Score Weighting for Causal Inference with Multiple Treatments

Abstract: Unconfounded comparisons of multiple groups are common in observational studies. Motivated from an observational study comparing three medications (causal comparison) and a racial disparity study in health services research (unconfounded descriptive comparison), we propose a unified framework, the balancing weights, for estimating causal effects with multiple treatments using propensity score weighting. These weights incorporate the generalized propensity score to balance the weighted covariate distribution of each treatment group, all weighted toward a common pre-specified target population. The class of balancing weights include several existing approaches such as inverse probability weights and trimming weights as special cases. Within this framework, we focus on a class of target estimands based on linear contrasts and their corresponding nonparametric weighting estimators. We further develop the generalized overlap weights, constructed as the product of the inverse probability weights and the harmonic mean of the generalized propensity scores. The generalized overlap weights correspond to the target population with the most overlap in covariates between treatments, similar to the population in equipoise in clinical trials. These weights are bounded and thus bypass the problem of extreme propensities. We show that the generalized overlap weights minimize the total asymptotic variance of the nonparametric estimators for the pairwise contrasts within the class of balancing weights. We consider two balance check criteria and propose a new sandwich variance estimator for estimating the causal effects with generalized overlap weights. We apply these methods to study the causal effect of three anti-coagulants on patient’s mortality and to estimate the racial disparities in medical expenditure. The operating characteristics of the new weighing method is further illustrated by simulations.

December 21, 2018-10:00am-11:00am CTRB 3161/3162
Hai Shu
Faculty Candidate
Postdoctoral Fellow, Department of Biostatistics
University of Texas, MD Anderson Cancer Center

Title: Extracting Common and Distinctive Signals from High-dimensional Datasets

Abstract: Modern biomedical studies often collect large-scale multi-source/-modal datasets on a common set of objects. A typical approach to the joint analysis of such high-dimensional datasets is to decompose each data matrix into three parts: a low-rank common matrix that captures the shared information across datasets, a low-rank distinctive matrix that characterizes the individual information within the single dataset, and an additive noise matrix. Existing decomposition methods often focus on the orthogonality between the common and distinctive matrices, but inadequately consider a more necessary orthogonal relationship among the distinctive matrices. The latter guarantees that no more shared information is extractable from the distinctive matrices. We propose decomposition-based canonical correlation analysis (D-CCA), a novel decomposition method that defines the common and distinctive matrices from the L2 space of random variables rather than the conventionally used Euclidean space, with a carefully designed orthogonal relationship among the distinctive matrices. The associated estimators of common and distinctive signal matrices are asymptotically consistent and have reasonably better performance than state-of-the-art methods in both simulated data and the analyses of breast cancer genomic datasets from The Cancer Genome Atlas and motor-task functional MRI data from the Human Connectome Project.

January 14, 2019-11:00am-12:00pm CTRB 3161/3162
Fei Gao
Faculty Candidate
Senior Fellow, Department of Biostatistics
University of Washington

Title: Non-iterative Estimation Update for Parametric and Semiparametric Models with Population-based Auxiliary Information

Abstract: With the advancement in disease registries and surveillance data, population-based information on disease incidence, survival probability or other important biological characteristics become increasingly available. Such information can be leveraged in studies that collect detailed measurements but with smaller sample sizes. In contrast to recent proposals that formulate the additional information as constraints in optimization problems, we develop a general framework to construct simple estimators that update the usual regression estimators with some functionals of data that incorporate the additional information. We consider general settings which include nuisance parameters in the auxiliary information, non-i.i.d. data such as case-control sampling, and semiparametric models with innite dimensional parameters. Detailed examples of several important data and sampling settings are provided.

January 16, 2019-11:00am-12:00pm CTRB 3161/3162
Nicholas Henderson
Faculty Candidate
Post-doctoral Fellow, Department of Biostatistics and Bioinformatics
Johns Hopkins University School of Medicine

Title: Estimating heterogeneous treatment effects with censored data via fully nonparametric Bayesian accelerated failure time models

Abstract: Individuals often respond differently to identical treatments, and characterizing such variability in treatment response is an important aim in the practice of personalized medicine. In this article, we describe a nonparametric accelerated failure time model that can be used to analyze heterogeneous treatment effects (HTE) when patient outcomes are time-to-event. By utilizing Bayesian additive regression trees and a mean-constrained Dirichlet process mixture model, our approach offers a flexible model for the regression function while placing few restrictions on the baseline hazard. Our nonparametric method leads to natural estimates of individual treatment effect and has the flexibility to address many major goals of HTE assessment. Moreover, our method requires little user input in terms of model specification for treatment covariate interactions or for tuning parameter selection. Our procedure shows strong predictive performance while also exhibiting good frequentist properties in terms of parameter coverage and mitigation of spurious findings of HTE. We illustrate the merits of our proposed approach with a detailed analysis of two large clinical trials for the prevention and treatment of congestive heart failure using an angiotensin-converting enzyme inhibitor. The analysis revealed considerable evidence for the presence of HTE in both trials as demonstrated by substantial estimated variation in treatment effect and by high proportions of patients exhibiting strong evidence of having treatment effects which differ from the overall treatment effect.

January 25, 2019-11:00am-12:00pm CTRB 3161/3162
Rickey Carter
Professor, Department of Biostatistics
Mayo Clinic College of Medicine and Science

Title:Data science and biostatistics: the power of thinking differently through team science

Abstract: The development of artificial intelligence solutions to help advance the practice of health care could be considered the next space race. In this presentation, the journey from traditional biostatistics to the new world of artificial intelligence will be explored through a series of motivating examples that emphasize the role of team science and thinking inside the black box. The talk will highlight the role of convolutional neural networks and deep learning for data that may not be typically envisioned as an image. The presentation will conclude with information on new resources that are being developed to provide cross training opportunities to help raise awareness of machine learning techniques within the biostatistical community.

February 8, 2019-11:00am-12:00pm CTRB 2161/2162
Jon Shuster
Professor Emeritus, Department of Health Outcomes and Biomedical Informatics
University of Florida

Title: Meta-Analysis of Clinical Trials: Effects-at-Random or Studies-at-Random?

Abstract: Meta-Analysis and Systematic Reviews stand at the top of most “Evidence Pyramids”.  Virtually all random-effects meta-analyses ever done (classical or Bayes) use the “Effects-at-Random” premise, where the random effect size for each study is drawn from an urn and the population mean of the urn is estimated.  The almost never used “Studies-at-Random” instead presumes that the observed studies are a random sample of studies, drawn from a large conceptual urn of studies.  The important distinction is that in the “effects-at-random” presumption, there can be no association between the random effect sizes and the study design parameters, which determine study weights.    It is impossible to prove beyond a reasonable doubt that no such association exists.  The framework for inference in studies-at-random, which estimates the mean outcome in the urn of studies, using the study sample sizes as its weights, offers many advantages over effects-at-random.  We cite four here.  First, in the target population, the mean of each completed study is known without error.  Single-stage cluster sampling methods can easily be applied.  Second, studies-at-random, but not effects-at-random, recognize that the study sample sizes are random variables, a source of variation conveniently not considered in effects-at-random.  Third, Studies-at-Random but not Effects-at-Random can accommodate interaction between study designs and study point estimates.  Lastly, the asymptotic distribution of effects-at-random, but not studies-at-random require either normal assumptions or large samples within studies. Both approaches are asymptotic in the number of studies being combined.   Of note, we shall present an eye-opening real situation for effects-at-random, where keeping the point estimates as they were, but cutting the standard errors uniformly by 30%, causes a highly significant result to become non-significant.  This cannot happen to studies-at-random.  We shall apply studies-at-random methods to four situations: (1) Low event-rate binomial trials, (2) Trials with quantitative endpoints, (3) survival analysis trials, and (4) Bland-Altman analysis with repeated measures within subjects.  Unlike the classical repeated measures Bland-Altman methods, it is completely robust to the lack of independence within subjects.  Lives and $ millions are at stake from the nearly universally applied inappropriate effects-at-random methods. This ought to be the biggest statistical story of this decade

February 11, 2019 – 10:00am-11:00am CTRB 3161/3162
Xiang-Yang Lou
Faculty Candidate, CDCC
Professor, Department of Pediatrics
University of Arkansas for Medical Sciences

Title: GMDR: A machine learning method for identifying multifactor interactions

Abstract: Human health and diseases are determined by the interaction of a multitude of “nature and nurture” influences including genetic, environmental, lifestyle (i.e., diet and physical activity), and other factors.  Identification of multifactor gene-gene and gene-environment interactions underlying complex traits poses one of the great challenges to today’s genetic study and precision medicine.  A versatile machine learning method, the generalized multifactor dimensionality reduction (GMDR), is proposed to hunt for multifactor interactions, which is applicable to a breadth of phenotypes, such as continuous, count, dichotomous, polytomous nominal, ordinal, survival and multivariate, and a variety of study designs, such as unrelated, family-based, and pooled unrelated and family samples.  A GMDR software package is developed to implement this serial of GMDR analyses for various scenarios.  This package can run on various platforms including MS Windows, Linux, and Mac, and provides two forms of user-friendly interfaces for execution: graphical user interface (GUI) and command line interface (CLI).  GUI offers an integrated environment with a series of self-explanatory and easy-to-follow options, while CLI provides an alternative means for advanced users to automate repetitive tasks and perform large-scale analyses more efficiently via scripting.  A proportional odds model-based GMDR was applied to genetic analysis of low-density lipoprotein (LDL) cholesterol, a causal risk factor for coronary artery disease, in the Multi-Ethnic Study of Atherosclerosis and a significant joint action of the CELSR2, SERPINA12, HPGD, and APOB genes was identified.  In conclusion, GMDR provides a practicable solution to problems in detection of interactions.

February 13, 2019 – 11:00am-12:00pm CTRB 3161/3162
Haben Michael
Faculty Candidate, CDCC
Post-doctoral Research Associate, Department of Statistics
The Wharton School, University of Pennsylvania

Title: A Simple Weighted Approach for Instrumental Variable Estimation of Marginal Structural Mean Models

Abstract: Robins (1998) introduced marginal structural models (MSMs), a general class of counterfactual models for the joint effects of time-varying treatment regimes in complex longitudinal studies subject to time-varying confounding. He established identification of MSM parameters under a sequential randomization assumption (SRA), which  rules out unmeasured confounding of treatment assignment over time. We consider sufficient conditions for identification of the parameters of a subclass, Marginal Structural Mean Models (MSMMs), when sequential randomization fails to hold due to unmeasured confounding, using instead a time-varying instrumental variable. We propose several identifying conditions, describe a simple weighted estimator, and examine its finite-sample properties in a simulation study.

February 22, 2019 – 11:00am-12:00pm CTRB 2161/2162
Saeed Hassanpour
Assistant Professor of Biomedical Data Science
Giesel School of Medicine at Dartmouth

Title: Machine Learning in the era Precision Health

Abstract: Recent advancements in machine learning, particularly in deep neural networks, have shown state-of-the-art results for text and image analysis and in some cases even have exceeded human performance. In the medical domain, machine learning methods are used to extract critical medical insights from various unstructured health-related data such as clinical notes, pathology slides, radiology images, and even patients’ social media data. These approaches can augment patient care using distilled insights and make large and hard-to-analyze data sources clinically actionable to foster precision health. In this talk, I will share some of the ongoing research projects in my lab that use machine learning, deep learning, and medical image analysis for medical diagnosis, prognosis, and screening.

March 15, 2019 – 11:00am-12:00pm CTRB 2161/2162
Rao Chaganty
Department of Mathematics and Statistics
Old Dominion University

Title: Copula models for analyzing longitudinal and familial data.

Abstract: Health sciences research and clinical trials often track treatments on individual subjects or families, resulting in longitudinal or familial data. Statistical analysis of such data is straight-forward when response variables are continuous, as the multivariate normal distribution can be used to model both potential predictors of the responses and also the dependence between repeated measurements or within families. However, the restricted ranges of some discrete observations makes longitudinal or clustered analyses challenging, as they lead to stringent constraints on some parameters in the multivariate probability distributions for these discrete outcomes. Copula functions provide a powerful alternative since they separate the dependence modeling from the marginal distributions, and hence any restrictions from the range of the outcome, alleviating some analytical difficulties posed by traditional methods. In this talk, I will give a short introduction to copulas and through motivating examples elucidate the use of the models in analyzing data that occur in health sciences and clinical trials.

March 21, 2019 – 1:30pm-2:30pm CTRB 2161/2162
Donna LaLonde
Director Strategic Initiatives and Outreach
American Statistical Association

Title: ASA: Statistically Significant with YOUR Help

Abstract: The mission of the American Statistical Association (ASA) is to promote the practice and profession of statistics. ASA members, who are in academia, government, research, and business, provide leadership and service for the chapters and sections. The association’s activities are focused on aiding the professional development of our members, supporting statistics and data science education at all levels, engaging the public and media to raise awareness about the profession, and advocating for data-driven decision making and policy. This seminar will describe the ASA initiatives and programs and the opportunities for joining the community and contributing to the mission.

April 12, 2019 – 11:00am-12:00pm CTRB 2161/2162
Audrey Hendricks
Assistant Professor
Department of Statistics
University of Colorado

Title: Methods to Improve the use of Common Controls in Sequencing Studies

Abstract: Uncovering the functional mechanism of genetic associations across different variants, environmental and genetic backgrounds, tissues, and phenotypes is challenging and requires a large amount of resources. Additionally, gathering and completing adequately sized and powered studies of complex diseases to identify novel genetic associations is expensive. As such, wise allocation of limited resources is essential. Large genetic resources such as the Genome Aggregation Database (gnomAD, ~140,000 sequenced samples), NHLBI’s Trans-Omics for Precision Medicine (TOPMed) BRAVO interface (>60,000 genomes), and NHGRI’s Genome Sequencing Program (GSP, ~20,000 genomes for use as controls) have the potential to be used as controls in studies of complex diseases. Differences between internal data and external data will exist creating a large potential for confounding and an increase in type I and type II error. Robust methods are needed to appropriately use these data. Additionally, potential common control samples are often heterogenous with respect to ancestry and may contain individuals with common diseases (e.g. Type 2 Diabetes). This heterogeneity can further confound or decrease power to detect association with case status. Here, I present two approaches to improve the use of common controls from sequencing studies. First, Proxy External Controls Association Test (ProxECAT) tests for association of a genetic region with case status while controlling for differences in sequencing data generation. Second, identifying hidden ancestries estimates the proportion of global ancestry group within summary genetic data. Both methods utilize publicly available frequency level data enabling broad and efficient use of external data increasing our ability to detect genetic associations while adequately controlling for confounding.

April 19, 2019, 11:00am-12:00pm CTRB 3161/3162
Leo Duan
Assistant Professor
Department of Statistics
University of Florida

Title: High Dimensional Multi-view Clustering with Uncertainty Quantification

Abstract: High dimensional data often contain multiple facets, and several clustering patterns (views) can co-exist under different feature subspaces. While multi-view clustering algorithms were proposed, the uncertainty quantification remains difficult — a particular challenge is in the high complexity of estimating the cluster assignment probability under each view, or/and to efficiently share information across views. In this article, we propose an empirical Bayes approach — viewing the similarity matrices generated over subspaces as rough first-stage estimates for co-assignment probabilities, in its Kullback-Leibler neighborhood we obtain a refined low-rank soft cluster graph, formed by the pairwise product of simplex coordinates. Interestingly, each simplex coordinate directly encodes the cluster assignment uncertainty. For multi-view clustering, we equip each similarity matrix with a mixed membership over a small number of latent views, leading to effective dimension reduction. With a high model flexibility, the estimation can be succinctly re-parameterized as a continuous optimization problem, hence enjoys gradient-based computation. Theory establishes the connection of this model to random cluster graph under multiple views. Compared to single-view clustering approaches, substantially more interpretable results are obtained when clustering brains from human traumatic brain injury study, using high-dimensional gene expression data