Alicia Carriquiry, Iowa State University
“Statistics and the Fair Administration of Justice”

The emergence of DNA analysis as an effective forensic tool in the 1990s was a revelation, in that for the first time it was possible to quantify the degree of association between a crime scene sample and a suspect.  It also had the effect of shining a light on other forensic practices, most of which lack the rigorous and widely accepted scientific foundations of DNA profiling and for which error rates are largely unknown. In the US criminal justice system, jurors choose between two competing hypothesis:  the suspect is the source of the evidence found at the crime scene or is not.  We discuss a likelihood ratio framework for assessing the probative value of evidence, that relies on Bayes’ theorem and that – at least in principle – can be adapted to any time of evidence.  We present two examples to illustrate its application, one using the chemical composition of glass fragments and the other one using information about the surface topography of bullet lands. 

 

Erin Schliep, University of Missouri
"Velocities for spatio-temporal processes"

We introduce the notion of velocity in an effort to expand inference for stochastic processes defined over space and time. For a realization of a stochastic process defined over a spatio-temporal domain, we can obtain the instantaneous gradient of the surface in time and space for a given location and time. The ratio of these two gradients can be interpreted as a velocity; the change in space in a given direction per unit time. With a Gaussian process realization for the spatio-temporal surface, we can obtain these gradients as directional derivative processes. The direction of the maximum gradient in space and the associated magnitude, which yields the direction and magnitude of minimum velocity, offer practical interpretations. Dimension reduction through predictive processes and sparsity through nearest neighbor Gaussian processes provide computational efficiency. We apply our method to two case studies. First, we specify a geostatistical model for average annual temperature across the eastern United States for the years 1963– 2012. Estimates of the velocity of temperature change are compared across a collection of spatial locations and time points. Then, for a spatio-temporal point pattern of theft events in San Francisco in 2012, we specify a log Gaussian Cox process model to explain the events. We estimate the velocity of the point pattern, where the magnitude and direction of the minimum velocity provides the slowest rate and direction of movement required to maintain constant chance for an event

Dr. Erin Schliep is an Assistant Professor in the Department of Statistics at the University of Missouri. She received her PhD at Colorado State University in 2013. Prior to starting at Missouri, she spent two years at Duke University as a postdoctoral fellow. Dr. Schliep's research focuses on developing statistical methodology to further the understanding of environmental processes. Her work often entails methods for dependent data, with emphasis on spatio-temporal and multivariate data.

 

Jing Li, Arizona State University
"A Novel Positive Transfer Learning Model for Telemonitoring of Parkinson’s Disease"

When learning a new skill, people can transfer their knowledge about other related skills they have grasped to expedite the learning. This extraordinary human ability has inspired the development of a class of statistical machine learning models called Transfer Learning (TL). When building a predictive model in a target domain, TL integrates data from the specific domain and knowledge transferred from other related source domains to mitigate sample size shortage of the target domain. TL provides an ideal model for Precision Medicine, in which patient-specific models are needed so that diagnosis and treatment can be customized for each patient’s unique characteristics. However, each patient only has limited data due to time or resource constraints. Transfer learning from other patients with a similar disease offers the possibility of building a robust model.

In this talk, I focus on presenting a Positive Transfer Learning (PTL) model developed to enable patient-specific telemonitoring of the Parkinson’s Disease (PD) using remote sensing devices or smartphones. An important problem that the existing TL literature has overlooked is “negative transfer”, referred to as the situation of the worse performance of a TL model than a model without transferring from any source domain. We provide a theoretical study on the risk of negative transfer, which further motivates the development of the PTL model that is robust to negative transfer. Telemonitoring is an emerging platform in health care that uses smart sensors to remotely monitor patient conditions. It provides logistic convenience and cost-effectiveness, allowing for close monitoring of disease progression and timely medical decision. I will present an application of PTL in telemonitoring of PD.

At the end of the talk, I will briefly present our developments of TL models in other healthcare applications, including modality-wise missing data imputation for Alzheimer’s Disease early detection and learning discriminant subgraph classifiers for migraine diagnosis.

 

Ryan Martin, North Carolina State University
"Empirical priors and adaptive posterior concentration rates."

In high and infinite-dimensional problems, the Bayesian prior specification can be a challenge.  For example, in high-dimensional regression, while sparsity considerations drive the choice of prior on the model, there is no genuine prior information available about the coefficients in a given model.  Moreover, the choice of prior for the model-specific parameters impacts both the computational and theoretical performance of the posterior.  As an alternative, one might be tempted to choose a computationally simple "informative" empirical prior on the model-specific parameters, depending on data in a suitable way.  In this talk, I will present a new approach for empirical prior specification in high-dimensional problems, based on the idea of data-driven prior centering.  I will give (adaptive) concentration rate results for this new "empirical Bayes" posterior in several specific examples, with illustrations, and I will also say a few words about the general construction and corresponding theory.

 

Tatjana Miljkovic, Miami University of Ohio 
"Different approaches to loss modeling and their impact on risk measure assessment"

The “key risk measures” such as Value-at-Risk (VaR) and Conditional Tail Expectation (CTE) are important for capital allocation decisions as they inform actuaries and risk managers about the degree to which a line of business or a company is exposed to a particular aspect of risk. These measures are typically estimated based on the best fitting statistical model selected from a set of models considered for loss modeling. We propose two different approaches for finding this best fitting model. The first approach is based on finite mixtures where the components belong to the same parametric distribution family. The second approach uses composite models where two different distributions are used for the head and tail and these distributions are combined in a smooth way at a specific threshold. In addition, we propose to estimate risk measures taking the model uncertainty risk into account and show how model averaging can be used to obtain the point estimates and their confidence intervals. Two popular data sets on Danish Fire and Norwegian Fire losses are used to illustrate the proposed methods.

 

Maria-Pia Victoria-Feser, University of Geneva, Switzerland
"Finite Sample Simulation Based Switched $Z$-estimation (SwiZs) and inference"

In this paper, we propose a class of simulation-based estimators that are, in general, numerically simple to implement and fast to compute and also set suitable (and mild) conditions for consistency and finite sample bias reduction and coverage probability for inference. This class can be used in complex settings, including high dimensional ones (e.g $p$ large relative to $n$), with regularized (shrinkage) methods and robust estimation and inference. The inferential framework is rooted in the one of indirect inference combined with Fisher's switching principle for inferential purposes. The links with other simulation-based inferential methods such as the bootstrap and the approximated Bayesian computing are formally made and lead to the conclusion that SwiZs brings clear advantages in terms of computational efficiency, bias reduction, and probability coverage, with finite (and small) sample sizes. Moreover, the SwiZs outperforms asymptotic correction methods designed for the same purposes. We illustrate the theoretical results by means of exact derivations and simulations in complex settings.

Maria graduated from the University of Geneva (Ph. D. in econometrics and statistics) in 1993, Maria-Pia  Victoria-Feser has held several positions in different institutions or Departments. She was appointed as lecturer in statistics at the London School of Economics (1993-1996), as assistant and associate professor in statistics (part-time) at the Faculty of psychology and educational sciences at the University of Geneva (1997-2005), financed by a Swiss National Science Found grant, full professor in statistics at the University of Geneva since 2001. She has also acted for the foundation and as founding dean (2013-2017) of the Geneva School of Economics and Management (GSEM) of the University of Geneva, and as founding

director of the Research Center for Statistics of the University of Geneva (created in 2011).

Maria-Pia Victoria-Feser’s research interests are in fundamental statistics (robust statistics, model

selection and simulation-based inference in high dimensions for complex models) with applications in economics (welfare economics, extremes), psychology and social sciences (generalized linear latent variable models, media analytics), and engineering (time series for geo-localization). She has published in leading journals in statistics as well as in related fields. 

 

Xiangrong Yin, University of Kentucky
"Expected Conditional Characteristic Function-based Measures for Testing Independence"

We propose a novel class of independence measures for testing independence between two random vectors based on the discrepancy between the conditional and the marginal characteristic functions. If one of the variables is categorical, our asymmetric index can be redeemed as the between group dispersion in a kernel ANOVA decomposition and leads to more powerful tests than those relying on symmetric measures. In addition, our index is also applicable when both variables are continuous. We develop two empirical estimates and obtain their respective asymptotic distributions. We illustrate the advantages of our approach by numerical studies across a variety of settings.

 

Andrzej Ruszczynski, Rutgers University 
"
Risk Forms: Duality, Disintegration, and Statistical Estimation"

We introduce the concept of a risk form, which is a real functional on the product of two spaces: the space of measurable functions and the space of measures on a Polish space. We present a dual representation of risk forms and generalize the classical Kusuoka representation to this setting. For a risk form acting on a product space, we define marginal and conditional forms and we prove a disintegration for Risk Forms: Duality, Disintegration, and Statistical Estimation

We introduce the concept of a risk form, which is a real functional on the product of two spaces: the space of measurable functions and the space of measures on a Polish space. We present a dual representation of risk forms and generalize the classical Kusuoka representation to this setting. For a risk form acting on a product space, we define marginal and conditional forms and we prove a disintegration formula, which represents a risk form as a composition of its marginal and conditional forms. We apply the proposed approach to two-stage optimization problems with partial information and decision-dependent observation distribution. Finally, we discuss statistical estimation of risk forms and present a central limit formula for a class of forms defined by nested expectations.mula, which represents a risk form as a composition of its marginal and conditional forms. We apply the proposed approach to two-stage optimization problems with partial information and decision-dependent observation distribution. Finally, we discuss statistical estimation of risk forms and present a central limit formula for a class of forms defined by nested expectations.

Andrzej Ruszczynski received his Ph.D. and habilitation degrees in control engineering from Warsaw University of Technology in 1976 and 1983, respectively. He has been with Warsaw University of Technology (Poland), University of Zurich (Switzerland), International Institute of Applied Systems Analysis (Laxenburg, Austria), Princeton University, University of Wisconsin-Madison, and Rutgers University. Dr. Ruszczynski is one of the creators of and main contributors to the field of risk-averse optimization, author of "Nonlinear Optimization" (Princeton University Press, 2006), co-author of "Lectures on Stochastic Programming" (SIAM, 2009), "Stochastic Programming" (Elsevier, 2003), and author of more than 100 articles in the area of optimization. He is the recipient of the 2018 Dantzig Prize of SIAM and the Mathematical Optimization Society, and an INFORMS Fellow.

 

Bailey Fosdick, Colorado State University
"Standard errors for regression on relational data with exchangeable errors"

Relational arrays represent interactions or associations between pairs of actors, often in varied contexts or over time. Such data appear as, for example, trade flows between countries, financial transactions between individuals, contact frequencies between school children in classrooms, and dynamic protein-protein interactions. In this talk, we propose and evaluate a new class of parameter standard errors for models that represent elements of a relational array as a linear function of observable covariates. Uncertainty estimates for regression coefficients must account for both heterogeneity across actors and dependence arising from relations involving the same actor. Existing estimators of parameter standard errors that recognize such relational dependence rely on estimating extremely complex, heterogeneous structure across actors. Leveraging an exchangeability assumption, we derive parsimonious standard error estimators that pool information across actors and are substantially more accurate than existing estimators in a variety of settings. This exchangeability assumption is pervasive in the network and array models in the statistics literature, but not previously considered when adjusting for dependence in a regression setting with relational data. We show that our estimator is consistent and demonstrate improvements in inference through simulation and a data set involving international trade.

Dr. Bailey Fosdick is an Assistant Professor in the Department of Statistics at Colorado State University. She earned her Ph.D. at the University of Washington in 2013 and spent a year as a Postdoctoral Fellow at the Statistical and Applied Mathematical Sciences Institute. Dr. Fosdick’s research primarily focuses on developing a methodology for the statistical analysis of social networks, motivated by pressing questions in population ecology, public health, political science, and sociology. Dr. Fosdick also works on methods for survey analysis and the analysis of multivariate data.

 

Ron Gallant, Pennsylvania State University 
"Bayesian Inference Using the EMM Objective Function With Application to the Dominant Generalized Blumenthal-Getoor Index of an Ito Semimartingale"

We modify the Gallant and Tauchen (1996) efficient method of moments (EMM) method to perform exact Bayesian inference, where exact means no reliance on asymptotic approximations.  We use this modification to evaluate the empirical plausibility of recent predictions from high-frequency financial theory regarding the small-time movements of an Ito semimartingale.  The theory indicates that the probability distribution of the small moves should be locally stable around the origin. It makes no predictions regarding large rare jumps, which get filtered out. Our exact Bayesian procedure imposes support conditions on parameters as implied by this theory.  The empirical application uses Index options extending over a wide range of moneyness, including deep out of the money puts. The evidence is consistent with a locally stable distribution valid over most of the support of the observed data while mildly failing in the extreme tails, about which the theory makes no prediction.  We undertake diagnostic checks on all aspects of the procedure.  In particular, we evaluate the distributional assumptions regarding a semi-pivotal statistic, and we test by Monte Carlo that the posterior distribution is properly centered with short credibility intervals.  Taken together, our results suggest a more important role than previously thought for pure jump-like models with diminished, if not absent, diffusive component. 

Ron Gallant is a Liberal Arts Professor of Economics at The Pennsylvania State University. Prior to joining the Penn State faculty, he was Hanes Corporation Foundation Professor of Business Administration, Fuqua School of Business, Duke University, with a secondary appointment in the Department of Economics, Duke University, and Distinguished Scientist in Residence, Department of Economics, New York University.

Before joining the Duke faculty, he was Henry A. Latane Distinguished Professor of Economics at the University of North Carolina at Chapel Hill.  He retains emeritus status at UNC.

Previous to UNC he was, successively, Assistant, Associate, Full, and Drexel Professor of Statistics and Economics at North Carolina State University. Gallant has held visiting positions at the l'Ecole Polytechnique, the University of Chicago, and Northwestern University.

He received his A.B. in mathematics from San Diego State University, his M.B.A. in marketing from the University of California at Los Angeles, and his Ph.D. in statistics from Iowa State University.  He is a Fellow of both the Econometric Society and the American Statistical Association.  He has served on the Board of Directors of the NationalBureau of Economic Research, the Board of Directors of the American Statistical Association, and on the Board of Trustees of the National Institute of Statistical Sciences.  He is past co-editor of theJournal of Econometrics and past editor of The Journal of Business and Economic Statistics.

 

Bo Li, University of Illinois at Urbana-Champaign
"Spatially Varying Autoregressive Models for Prediction of New HIV Diagnoses"

 In demand of predicting new HIV diagnosis rates based on publicly available HIV data that is abundant in space but has few points in time, we propose a class of spatially varying autoregressive (SVAR) models compounded with conditional autoregressive (CAR) spatial correlation structures. We then propose to use the copula approach and a flexible CAR formulation to model the dependence between adjacent counties. These models allow for spatial and temporal correlation as well as space-time interactions and are naturally suitable for predicting HIV cases and other spatio-temporal disease data that feature a similar data structure. We apply the proposed models to HIV data over Florida, California and New England states and compare them to a range of linear mixed models that have been recently popular for modeling spatio-temporal disease data. The results show that for such data our proposed models outperform the others in terms of prediction.

 

Shuheng Zhou, University of California, Riverside 
"Tensor models for large, complex and high dimensional data"

Building models and methods for large spatio-temporal data is important for many scientific and application areas that affect our lives. In this talk,  I will discuss several interrelated yet distinct models and methods on a graph and mean recovery problems with applications in neuroscience, spatio-temporal modeling, and genomics.

In the first result, I discuss the Gemini methods for estimating the graphical structures and underlying parameters, namely, the row and column covariance and inverse covariance matrices from the matrix variate data. Under sparsity conditions, we show that one is able to recover the graphs and covariance matrices with a single random matrix from the matrix variate normal distribution. Our method extends, with suitable adaptation, to the general setting where replicates are available. We establish consistency and obtain the rates of convergence in the operator and the Frobenius norm. We show that having replicates will allow one to estimate more complicated graphical structures and achieve faster rates of convergence. We provide simulation evidence showing that we can recover graphical structures as well as estimating the precision matrices, as predicted by theory.

It has been proposed that complex populations, such as those that arise in genomics studies, may exhibit dependencies among observations as well as among variables. This gives rise to the challenging problem of analyzing high-dimensional data with unknown mean and dependence structures. In the second part of the talk, I  present a practical method utilizing generalized least squares and penalized (inverse) covariance estimation to address this challenge. We establish consistency and obtain rates of convergence for estimating the mean parameters and covariance matrices iteratively. We use simulation studies and analysis of genomic data from a twin study of ulcerative colitis to illustrate the statistical convergence and the performance of our methods in practical settings.

In the final part of the talk (time permitting), I will discuss a parsimonious model for precision matrices of matrix-normal data based on the Cartesian product of graphs. By enforcing extreme sparsity (the number of parameters) and explicit structures on the precision matrix, this model has excellent potential for improving the scalability of the computation and interpretability of complex data analysis. We establish consistency for both the Bi-graphical Lasso (BiGLasso) and Tensor Graphical Lasso (TeraLasso) estimators and obtain the rates of convergence for estimating the precision matrix.

This talk is based on joint work with Michael Hornstein, Roger Fan, Kerby Shedden, Kristjan Greenewald and Al Hero.

 

Po-Ling Loh, University of Wisconsin and Madison 
"Mean estimation for entangled single-sample distributions"

We consider the problem of estimating the common mean of univariate data, when independent samples are drawn from non-identical symmetric, unimodal distributions. This captures the setting where all samples are Gaussian with different unknown variances. We propose an estimator that adapts to the level of heterogeneity in the data, achieving near-optimality in both the i.i.d. setting and some heterogeneous settings, where the fraction of “low-noise" points is as small as log n. Our estimator n is a hybrid of the modal interval, shorth, and median estimators from classical statistics. The rates depend on the percentile of the mixture distribution, making our estimators useful even for distributions with infinite variance.

 

Jinbo Chen, University of Pennsylvania 
"A Novel Goodness-of-Fit Based Two-Phase Sampling Design for Studying Binary Outcomes"

In a biomedical cohort study for assessing the association between an outcome variable and a set of covariates, it is common that a subset of covariates can only be measured on a subgroup of study subjects. An important design question is which subjects to select into the subgroup towards increased statistical efficiency for association analyses. When the outcome is binary, one may adopt a case-control sampling design or a balanced case-control design where cases and controls are further matched on a small number of discrete covariates whose values are available for all subjects. While the latter achieves success in estimating odds ratio (OR) parameters for the matching covariates, to our best knowledge, to date, similar two-phase design options have not been explored for increasing statistical efficiency for assessing the remaining covariates, particularly the incompletely collected ones. To this end, assuming that an external model is available relating the outcome and complete covariates, we proposed a novel sampling scheme that over-samples cases and controls who have poorer goodness-of-fit based on the  external model and at the same time matches cases and controls on complete covariates similar as the balanced design. We developed an accompanying pseudo-likelihood method for OR parameter estimation, which can be performed using existing software package. Through extensive simulation studies and explorations in a real cohort study setting, we found that our design generally leads to a reduction in asymptotic variances of the estimated OR parameters to a similar extent for both the incomplete and complete covariates, and the reduction for the matching covariates was comparable to that of the balanced design.

 

Wei Vivian Li, UCLA 
"Statistical Methods to Uncover Hidden Information from Large-scale Genomic Data"

Statistical genomics is an emerging field and has played a crucial role in discovering genetic mechanisms behind the complex biological phenomenon. In this talk, I will discuss two statistical methods I have developed to uncover key hidden information from large-scale genomic data. First, I will introduce AIDE, a statistical method that selectively incorporates prior knowledge into the modeling to improve the statistical inference of missing RNA structures. AIDE is the first method that directly controls false RNA discoveries by implementing the statistical model selection principle. Second, I will introduce MSIQ, a statistical model for robust RNA quantification by integrating multiple biological samples under a Bayesian framework. MSIQ accounts for sample heterogeneity and achieves more accurate estimation of RNA quantities. Beyond these two methods, I will also summarize my other work in this area, and I will briefly introduce my ongoing work in asymmetric classification.

 

Jeffrey Regier, University of California, Berkeley 
"Statistical Inference for Cataloging the Visible Universe"

Jeffrey Regier is a postdoctoral researcher in the Department of Electrical Engineering and Computer Sciences at the University of California, Berkeley. He completed his Ph.D. in statistics at UC Berkeley in 2016. Previously, he received an MS in computer science from Columbia University (2005) and a BA in computer science from Swarthmore College (2003). His research focuses on statistical inference for large-scale scientific applications, including applications in astronomy and in genomics. He has received the Hyperion HPC Innovation Excellence Award (2017) and the Google Ph.D. Fellowship in Machine Learning (2013).

Abstract:  A key task in astronomy is to locate astronomical objects in images and to characterize them according to physical parameters such as brightness, color, and morphology. This task, known as cataloging, is challenging for several reasons: many astronomical objects are much dimmer than the sky background, labeled data is generally unavailable, overlapping astronomical objects must be resolved collectively, and the datasets are enormous -- terabytes now, petabytes soon. Existing approaches to cataloging are largely based on algorithmic software pipelines that lack an explicit inferential basis. In this talk, I present a new approach to cataloging based on inference in a fully specified probabilistic model. I consider two inference procedures: one based on variational inference (VI) and another based on MCMC. A distributed implementation of VI, written in Julia and run on a supercomputer, achieves petascale performance -- a first for any high-productivity programming language. The run is the largest-scale application of Bayesian inference reported to date. In an extension, using new ideas from variational autoencoders and deep learning, I avoid many of the traditional disadvantages of VI relative to MCMC, and improve model fit.

 

Walter Dempsey, Harvard University 
“Statistical network modeling via exchangeable interaction processes”

Many modern network datasets arise from processes of interactions in a population, such as phone calls, e-mail exchanges, co-authorships, and professional collaborations. In such interaction networks, the interactions comprise the fundamental statistical units, making a framework for interaction-labeled networks more appropriate for statistical analysis. In this talk, we present exchangeable interaction network models and explore their basic statistical properties. These models allow for sparsity and power law degree distributions, both of which are widely observed empirical network properties. I will start by presenting the Hollywood model, which is computationally tractable, admits a clear interpretation, exhibits good theoretical properties, and performs reasonably well in estimation and prediction.

In many settings, however, the series of interactions are structured. E-mail exchanges, for example, have a single sender and potentially multiple receivers. User posts on a social network such as a mobile health social support platform also have this structure.  I will introduce hierarchical exchangeable interaction models for the study of structured interaction networks. In particular, I will introduce an extension of the Hollywood model as the canonical model, which partially pools information via a latent, shared population-level distribution. A detailed simulation study and supporting theoretical analysis provide clear model interpretation, and establish global power-law degree distributions. A computationally tractable Gibbs sampling algorithm is derived. Inference will be shown on the Enron e-mail and ArXiv datasets.  I will end with a discussion of how to perform posterior predictive checks on interaction data. Using these proposed checks, I will show that the model fits both datasets well.

 

Yian Ma, University of California, Berkeley 
"
Bridging MCMC and Optimization"

 In this talk, Yian will discuss three ingredients of optimization theory in the context of MCMC: Non-convexity, Acceleration, and stochasticity. He will focus on a class of non-convex objective functions arising from mixture models. For that class of objective functions and will demonstrate that the computational complexity of a simple MCMC algorithm scales linearly with the model dimension, while optimization problems are NP-hard.

He will then study MCMC algorithms as optimization over the KL-divergence in the space of measures. By incorporating a momentum variable and will discuss an algorithm which performs accelerated gradient descent over the KL-divergence. Using optimization-like ideas, a suitable Lyapunov function is constructed to prove that an accelerated convergence rate is obtained.

Finally, he will present a complete recipe for constructing stochastic gradient MCMC algorithms that translates the task of finding a valid sampler into one of  choosing two matrices. He will then describe how stochastic gradient MCMC algorithms can be applied to applications involving temporally correlated data, where the challenge arises from the need to break the dependencies when considering minibatches of observations. 

Yian Ma is currently a post-doctoral fellow at University of California, Berkeley, hosted by Michael I. Jordan at the Foundations of Data Analysis Institute and RISELab. Prior to that, he obtained his PhD from applied mathematics department at University of Washington, working with Emily B. Fox at Mode Lab and Hong Qian. Before that, he obtained his bachelor's degree from the department of computer science and engineering at Shanghai Jiao Tong University

 

Andrés Felipe Barrientos, Duke University 
"
Bayesian nonparametric models for compositional data"

We propose Bayesian nonparametric procedures for density estimation for compositional data, i.e., data in the simplex space. To this aim, we propose prior distributions on probability measures based on modified classes of multivariate Bernstein polynomials. The resulting prior distributions are induced by mixtures of Dirichlet distributions, with random weights and a random number of components. Theoretical properties of the proposal are discussed, including large support and consistency of the posterior distribution. We use the proposed procedures to define latent models and apply them to data on employees of the U.S. federal government. Specifically, we model data comprising federal employees’ careers, i.e., the sequence of agencies where the employees have worked. Our modeling of the employees’ careers is part of a broader undertaking to create a synthetic dataset of the federal workforce. The synthetic dataset will facilitate access to relevant data for social science research while protecting subjects’ confidential information.

Bio: I am currently a Postdoctoral Associate at Duke University under the mentorship of Jerry Reiter. Before joining Duke, I received a grant from the Chilean National Fund for Scientific and Technological Development to work as a Postdoctoral Fellow at the Pontificia Universidad Católica de Chile, under the mentorship of Alejandro Jara. I completed my Ph.D. in Statistics at the Pontificia Universidad Católica de Chile under the supervision of Fernando Quintana (advisor) and Alejandro Jara (co-advisor). Before starting my Ph.D., I worked for Universidad del Valle, in Colombia (where I am originally from), as a Faculty at the School of Industrial Engineering and Statistics. I earned my bachelor’s degree in Statistics also at Universidad del Valle. 

 

Oscar Madrid Padilla, University of California, Berkeley 
"
Fused lasso in graph estimation problems"

The fused lasso, also known as (anisotropic) total variation denoising, is widely used for piecewise constant signal estimation with respect to a given undirected graph.  In this talk I will describe theory and methods for the fused lasso. Two classes of problems will be discussed: denoising on graphs, and nonparametric regression on general metric spaces. For the first of these tasks, I will provide a general upper bound on the mean squared error of the fused lasso that depends on the sample size and the total variation of the underlying signal. I will show that such upper bound is minimax when the graph is a tree of bounded degree, and I will present a surrogate estimator that attains the same upper bound and can be found in linear time. The second part of the talk will focus on extending the fused lasso to general nonparametric regression. The resulting approach, which we call the K-nearest neighbors (K-NN) fused lasso, involves (i) computing the K-NN graph of the design points; and (ii) performing the fused lasso over this K-NN graph. I will discuss several theoretical advantages over competing approaches: specifically, the estimator inherits local adaptivity from its connection to the fused lasso, and it inherits manifold adaptivity from its connection to the K-NN approach. Finally, I will briefly mention some of my other research directions.

 

Keegan Korthauer, Harvard T.H. Chan School of Public Health 
"Accurate inference of DNA methylation data: statistical challenges lead to biological insights"

DNA methylation is an epigenetic modification widely believed to act as a repressive signal of gene expression. Whether or not this signal is causal, however, is currently under debate. Recently, a groundbreaking experiment probed the influence of genome-wide promoter DNA methylation on transcription and concluded that it is generally insufficient to induce repression. However, the previous study did not make full use of statistical inference in identifying differentially methylated promoters. In this talk, I’ll detail the pressing statistical challenges in the area of DNA methylation sequencing analysis, as well as introduce a statistical method that overcomes these challenges to perform accurate inference. Using both Monte Carlo simulation and complementary experimental data, I’ll demonstrate that the inferential approach has improved sensitivity to detect regions enriched for downstream changes in gene expression while accurately controlling the False Discovery Rate. I will also highlight the utility of the method through a reanalysis of the landmark study of the causal role of DNA methylation. In contrast to the previous study, our results show that DNA methylation of thousands of promoters overwhelmingly represses gene expression.

 

Andee Kaplan, Duke University 
"Life After Record Linkage: Tackling the Downstream Task with Error Propagation"

Record linkage (entity resolution or de-duplication) is the process of merging noisy databases to remove duplicate entities that often lack a unique identifier. Linking data from multiple databases increases both the size and scope of a dataset, enabling post-processing tasks such as linear regression or capture-recapture to be performed. Any inferential or predictive task performed after linkage can be considered as the "downstream task.” While recent advances have been made to improve flexibility and accuracy of record linkage, there are limitations in the downstream task due to the passage of errors through this two-step process. In this talk, I present a generalized framework for creating a representative dataset post-record linkage for the downstream task, called prototyping. Given the information about the representative records, I explore two downstream tasks—linear regression and binary classification via logistic regression. In addition, I discuss how error propagation occurs in both of these settings. I provide thorough empirical studies for the proposed methodology, and conclude with a discussion of practical insights into my work.

More about Andee: http://andeekaplan.com/