Roger B. Newson
Department of Primary Care and Public Health, Imperial College London
The Clinical Practice Research Datalink (CPRD) is a centrally-managed data warehouse, storing data provided by the primary-care sector of the United Kingdom (UK) National Health Service (NHS). Medical researchers request retrievals from this database, which take the form of a collection of text datasets, whose format can be complicated. I have written a flagship package cprdutil, with multiple modules to input into Stata the many text dataset types provided in a CPRD retrieval. These text datasets may be converted either to Stata value labels or to Stata datasets, which can be created complete with value labels, variable labels, and numeric Stata dates. I have also written a fleet of satellite packages, to input into Stata the text datasets for retrievals of linked data, in which data are provided from non-CPRD sources, with CPRD identifier variables as a foreign key to allow data linkage. The modules of cprdutil are introduced. A demonstration example is given, in which a minimal CPRD database is produced in Stata, using cprdutil, and some principles of sensible programming practice for creating large databases are illustrated.
CNR-IRCrES, National Research Council of Italy
Matching is a popular estimator of the Average Treatment Effects (ATEs) within counterfactual observational studies. In recent years, however, many scholars have questioned the validity of this approach for causal inference, as its reliability draws heavily upon the so-called selection-on-observables assumption.
When unobservable confounders are possibly at work, they say, it becomes hard to trust matching results, and the analyst should consider alternative methods suitable for tackling unobservable selection. Unfortunately, these alternatives require extra information that may be costly to obtain, or even not accessible.
For this reason, some scholars have proposed matching sensitivity tests for the possible presence of unobservable selection. The literature sets out two methods: the Rosenbaum (1987) and the Ichino, Mealli, and Nannicini (2008) tests. Both are implemented in Stata.
In this work, I propose a third and different sensitivity test for unobservable selection in Matching estimation based on a ‘leave-covariates-out’ (LCO) approach. Rooted in the machine learning literature, this sensitivity test recalls a bootstrap over different subsets of covariates and simulates various estimation scenarios to be compared with the baseline matching estimated by the analyst.
Finally, I will present sensimatch, the Stata routine I developed to run this method, and provide some instructional applications on real datasets.
Department of Geography, Durham University
Spaghetti plots show many tangled lines (say for multiple time series or other functional traces) which are hard to distinguish and interpret. Paella plots show multiple point patterns for many groups, sufficiently mixed up that comparisons are made difficult. The talk surveys several tactics and strategies for better, friendlier comparisons. Devices range from showing data several times over to selection, smoothing and transformation.
Alexandra Blenkinsop and Babak Choodari-Oskooei
MRC Clinical Trials Unit at UCL, London
Multi-arm multi-stage (MAMS) adaptive clinical trials offer several practical advantages over traditional two-arm designs. The framework proposed by Royston et al. (2011) uses intermediate outcomes at interim analyses to drop research arms demonstrating insufficient benefit prior to the final analysis on the primary outcome. To our knowledge, the nstage program developed for Stata (Barthel, Royston and Parmar, 2009) is the only sample size software for MAMS trials with time-to-event outcomes, a common outcome measure in modern trials in cancer, cardiovascular disease and other disease areas. We present an update to nstage to increase the efficiency and uptake of MAMS designs.
nstage can accommodate efficacy stopping boundaries at interim analyses with a new option. Users choose a stopping rule and the program estimates the operating characteristics for a design which can assess for early evidence of overwhelming efficacy on the primary outcome when interim analyses for lack-of-benefit occur on an intermediate outcome. The user specifies whether the trial is expected to terminate, or continue with the remaining arms, should an efficacious research arm be identified before the final analysis of the trial. Since the probability of a type I error is increased through such a design, the updated program offers an option to search for a design which strongly controls the maximum familywise error rate at the desired level, if it is required.
The program estimates the operating characteristics of the chosen design within a reasonable time-frame, allowing users to compare trial designs for different input parameters easily. We illustrate how the updates can be used to design a trial with the drop-down menu, using the MAMS trial STAMPEDE as an example. We hope the new functionality of the program will serve a broader range of trial objectives and, as such, increase adoption of the design in practice.
MRC Clinical Trials Unit at UCL
Meta-analysis (MA) is a statistical technique for combining results from multiple independent studies, with the aim of estimating a single overall effect with a size, direction and precision consistent with the data. Traditionally, MA is performed on aggregated data (AD), where each observation represents the effect observed in a study, often derived from study publications. The user-written command metan (Harris et al. 2008) is by far the most popular Stata command for performing AD MA, but it was last updated in 2010 and has various flaws and limitations.
The alternative to AD MA is to obtain and analyse individual participant data (IPD), where the totality of data from all studies are stacked to form a single, large dataset. I have previously described (Fisher 2015) a user-written command, ipdmetan, which facilitates so-called ‘two-stage’ IPD MA. The two stages are fitting a given model to the data from each study in turn and combining the results using AD techniques. The second stage, performed using the AD command admetan, has now been expanded into a fully comprehensive AD MA command, with all the functionality of metan and much more besides. The co-author and maintainer of metan, Ross Harris, has confirmed to me that he is no longer in a position to maintain it and is happy for admetan to take its place.
Another important aspect of ipdmetan(and hence also admetan) is its forest plot capabilities. Not only is the forest plot engine much more efficient and capable of better plots ‘out of the box’ when compared with metan; it also allows the user to save and edit ‘forestplot results sets’ which are interpreted directly by the stand-alone program forestplot to produce fully flexible plots.
Fisher, D.J. 2015. Two-stage individual participant data meta-analysis and generalized forest plots. Stata Journal 15: 369–396.
Harris, R.J., Deeks, J.D., Altman, D.G., Bradburn, M.J., Harbord, R.M., Sterne, J.A.C. 2008. metan: fixed- and random-effects meta-analysis. Stata Journal 8: 3–28.
Biostatistics Research Group, Department of Health Sciences, University of Leicester
merlin can do a lot of things. From linear regression to a Weibull survival model, from a three-level logistic model, to a multivariate joint model of multiple longitudinal outcomes, a recurrent event and survival. merlin can do things I haven’t even thought of yet. I’ll take a single dataset, and attempt to show you the full range of capabilities of merlin, and talk about some of the new features following its rise from the ashes of megenreg. There’ll even be some surprises.
This presentation will discuss some popular supervised and unsupervised machine learning algorithms, and their recommended use, and then present implementations in Stata. The emphasis is on prediction and causal inference, and how to tailor a method to a specific application.
University of Exeter Business School
Daniel C. Schneider
Max Planck Institute for Demographic Research
Autoregressive distributed lag (ARDL) models are often used to analyse dynamic relationships with time series data in a single-equation framework. The current value of the dependent variable is allowed to depend on its own past realisations – the autoregressive part – as well as current and past values of additional explanatory variables – the distributed lag part. The variables can be stationary, nonstationary, or a mixture of the two types. In its equilibrium correction (EC) representation, the ARDL model can be used to separate the long-run and short-run effects, and to test for cointegration or, more generally, for the existence of a long-run relationship among the variables of interest.
This talk serves as a tutorial for the ardl Stata command that can be used to estimate an ARDL or EC model with the optimal number of lags based on the Akaike or Schwarz/Bayesian information criteria. Frequently asked questions will be addressed and a step-by-step instruction for the Pesaran, Shin, and Smith (2001 Journal of Applied Econometrics) bounds test for the existence of a long-run relationship will be provided. This test is implemented as the postestimation command estat ectest which features newly computed finite-sample critical values and approximate p-values. These critical values cover a wide range of model configurations and supersede previous tabulations available in the literature. They account for the sample size, the chosen lag order, the number of explanatory variables, and the choice of unrestricted or restricted deterministic model components.
The ardl command uses Stata’s regress command to estimate the model. As a consequence, specification tests can be carried out with the standard postestimation commands for linear (time series) regressions and the forecast command suite can be used to obtain dynamic forecasts.
Christopher F Baum
Universidad del Rosario
We estimate response surface coefficients for a large range of quantiles of the Leybourne and Taylor (2003, Journal of Time Series Analysis 24: 441–460) test for the presence of seasonal unit roots. This test statistic offers greater power gains in comparison with the familiar regression-based approach advocated by Hylleberg, Engle, Granger and Yoo (1990, Journal of Econometrics 44: 215–238), which is currently implemented in Stata via the command sroot, developed by Depalo (2009, Stata Journal 9: 422–438), and the further extensions introduced by the command hegy by del Barrio Castro, Bodnar and Sanso ́ (2016, Stata Journal 16: 740–760). The main feature of the Leybourne and Taylor test is that it achieves power gains through the use of forward and reverse HEGY regressions. The estimated response surfaces allow for different combinations of number of observations T and lag order in the test regressions p, where the latter can be either specified by the user or endogenously determined by the underlying data. The critical values depend on the method used to select the number of lags. We introduce the new Stata command ltur and illustrate its use with an empirical example. The new command permits the computation of the Leybourne and Taylor test statistics along with their associated critical values and approximate probability values.
Economic and Social Research Institute, Dublin
Christian B Hansen
University of Chicago Booth School of Business
Mark E Schaffer
Heriot-Watt University, Edinburgh
The field of machine learning is attracting increasing attention among social scientists and economists. At the same time, Stata offers to date only a very limited set of machine learning tools. This one-hour session introduces two Stata packages, lassopack and pdslasso, which implement regularized regression methods, including but not limited to the lasso (Tibshirani 1996 Journal of the Royal Statistical Society Series B), for Stata. The packages include features intended for prediction, model selection and causal inference, and are thus applicable in a wide range of settings. The commands allow for high-dimensional models, where the number of regressors may be large or even exceed the number of observations under the assumption of sparsity.
The package lassopack implements lasso, square-root lasso (Belloni et al. 2011 Biometrika; 2014 Annals of Statistics), elastic net (Zou and Hastie 2005 Journal of the Royal Statistical Society Series B), ridge regression (Hoerl and Kennard 1970 Technometrics), adaptive lasso (Zou 2006 Journal of the American Statistical Association) and post-estimation OLS. These methods rely on tuning parameters, which determine the degree and type of penalization. lassopack supports three approaches for selecting these tuning parameters: information criteria (implemented in lasso2), K-fold and h-step ahead rolling cross-validation (cvlasso), and theory-driven penalization (rlasso) due to Belloni et al. (2012 Econometrica). In addition, rlasso implements the Chernozhukov et al. (2013 Annals of Statistics) sup-score test of joint significance of the regressors.
The package pdslasso offers methods to facilitate causal inference in structural models. The package implements methods for selecting control variables (pdslasso) and/or instruments (ivlasso) from a large set of variables in a setting where the researcher is interested in estimating the causal impact of one or more (possibly endogenous) causal variables of interest. pdslasso and ivlasso rely on the lasso and square-root-lasso estimator implemented in lassopack. ivlasso also supports weak-identification-robust hypothesis tests and confidence sets.
StataCorp, College Station, TX
Stata 15 introduced the new estimation command menl for fitting nonlinear mixed-effects models, also known as nonlinear multilevel models and nonlinear hierarchical models. These models can be thought of in two ways: as nonlinear models containing random effects or as linear mixed-effects models in which some or all fixed and random effects enter nonlinearly. The overall error distribution is assumed to be Gaussian. Nonlinear mixed-effects models have been used to model drug absorption in the body, intensity of earthquakes, and growth of plants, to name a few.
In my presentation, I will demonstrate how to use the menl command to fit nonlinear mixed-effects models in a variety of applications, including population pharmacokinetics and macroeconomics.
Paul C. Lambert
Biostatistics Research Group, Department of Health Sciences, University of Leicester Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm
In observational studies with time-to-event outcomes, we expect that there will be confounding and would usually adjust for these confounders in a survival model. From such models an adjusted hazard ratio comparing exposed and unexposed subjects is often reported. This is fine, but hazard ratios can be difficult to interpret, are not collapsible, and there are further problems when trying to interpret hazard ratios as causal effects. Risks are much easier to interpret than rates and so quantifying the difference on the survival scale can be desirable.
In Stata, stcurve gives survival curves after fitting a model where certain covariates can be given specific values, but those not specified are given mean values. Thus it gives a prediction for an individual who happens to have the mean values of each covariate and may not reflect the average in the population. An alternative is to use standardization to estimate marginal effects, where the regression model is used to predict the survival curve for unexposed and exposed subjects at all combinations of other covariates included in the model. These predictions are then averaged to give marginal effects.
I will describe a command, stpm2 standsurv, to obtain various standardized measures after fitting a flexible parametric survival model. As well as estimating standardized survival curves, the command can estimate the marginal hazard function, the standardized restricted mean survival time and centiles of the standardized survival curve. Contrasts can be made between any of these measures (differences, ratios). A user-defined function can be given for more complex contrasts.
MRC Biostatistics Unit, University of Cambridge
The command makehlp was released in July 2012. It is a command for simplifying the construction of a help file by a SMCL help template. The command opens up the ado file and from the syntax line produces a template help file. In the past the user would need to edit this template and fill in the details such as the description, title, examples, etc. The new version of makehlp keeps the old functionality but also checks for the return codes to automatically produce a list of stored outputs. In addition a new syntax is introduced so that all the necessary text can be included in the ado file, for the various sections such as: description, title, examples, author, references, see also and all the options and returns descriptions. An example of the new syntax is desc, which will place all the text between the brackets into the help file description and will be formatted as it is written, so SMCL commands are allowed. This means that the ado-file can store the majority of the help file and the help file can subsequently be created using this ado file.
Centre for Energy Economics Research and Policy, Heriot-Watt University, Edinburgh
The package multishell is intended to speed up simulations by making use of multi- core processors and Stata’s shell command. In a first step, one or multiple do files are converted into batch files and added to a queue. After starting the main command, the current instance of Stata acts as an organiser and works through the queue. It allocates the batch files to a pre-set number of parallel running Stata instances.
multishell has several distinct features. If do files include forvalues and foreach loops, multishell dissects the loops and creates for each combination a new do file, which is added to the queue. This allows for an efficient allocation and use of processor power. multishell can be used to connect two or more computers to a cluster. multishell then allocates to each computer parts of the queue and a simulation is run parallel on multiple computers. Computational power is used efficiently and time saved.
StataCorp, College Station, TX
Stepping back from Mata, and even stepping a little back from the book, I use its publication as an excuse to describe Mata, its features, and what programming in Mata can achieve.
StataCorp, College Station, TX
Latent class analysis (LCA) allows us to identify and understand unobserved groups in our data. These groups may be consumers with different buying preferences, adolescents with different patterns of behaviour, or different health status classifications.
Stata 15 introduced new features for performing LCA. In this presentation, I will demonstrate how to use gsem with categorical latent variables to fit standard latent class models – models that identify unobserved groups based on a set of categorical outcomes. I will also show how we can extend the standard model to include additional equations and to identify groups using continuous, count, ordinal, and even survival times outcomes. We will use the results of these models to determine who is likely to be in a group and how that group’s characteristics differ from other groups.
Sarwar Islam Mozumder
Biostatistics Research Group, Department of Health Sciences, University of Leicester
In a typical survival analysis, the time to an event of interest is studied. For example, in cancer studies, researchers often wish to analyse a patient’s time to death since diagnosis. Similar applications also exist in economics and engineering. In any case, the event of interest is often not distinguished between different causes. Although this may sometimes be useful, in many situations, this will not paint the entire picture and restricts analysis. More commonly, the event may occur due to different causes, which better reflects real- world scenarios. For instance, if the event of interest is death due to cancer, it is also possible for the patient to die due to other causes. This means that the time at which the patient would have died due to cancer is never observed. These are known as competing causes of death, or competing risks. In a competing risks analysis, interest lies in the cause-specific cumulative incidence function (CIF). This can be calculated by either (1) transforming on (all) cause-specific hazards, or (2) using a direct relationship with the subdistribution hazards.
Obtaining cause-specific CIFs within the flexible parametric modelling framework by adopting approach (1) is possible by using the stpm2 post-estimation command, stpm2cif. Alternatively, since competing risks is a special case of a multi-state model, an equivalent model can be fitted using the multistate package. To estimate cause-specific CIFs using approach (2), stpm2 can be used by applying time-dependent censoring weights which are calculated on restructured data using stcrprep.
The above methods involve some form of data augmentation. Instead, estimation on individual-level data may be preferred due to computational advantages. This is possible using either approach, (1) or (2), with stpm2cr.
In this talk, an overview of these various tools are provided followed by some discussion on which of these to use and when.
Universit ́e Libre de Bruxelles
In regression analysis it is well known that skewness and excessive tail heaviness affect the efficiency of classical estimators. In this work, we propose an estimator that is highly efficient for a wide range of distributions. More specifically, in accordance with standard Le Cam theory, we define a sign and rank based estimator of the regression coefficients as a one-step update, based on a fully semiparametrically efficient central sequence, of an initial root n consistent estimator.
In the central sequence, the score function, initially defined on the basis of the exact underlying innovation density f, is estimated using the fact that f can be well adjusted by a Tukey g-and-h distribution. We present the results of some Monte Carlo simulations conducted to assess the finite sample performance of our estimator, in comparison with the ordinary least squares estimator and the approximated maximum likelihood estimator. We propose a Stata command flexrank to implement it in practice. The procedure is very fast and has a low computational complexity.