Department of Primary Care and Public Health, Imperial College London

r.newson@imperial.ac.uk

The Rubin method of confounder adjustment, in its 21st–century version, is a two–phase method for using observational data to estimate a causal treatment effect on an outcome variable. It involves first finding a propensity model in the joint distribution of a treatment variable and its confounders (the design phase), and then estimating the treatment effect from the

A good measure of this is Somers’ D(W|X), where W is a confounder or a propensity score, and X is the treatment variable. The SSC package somersd calculates Somers’ D for a wide range of sampling schemes, allowing matching and/or weighting and/or restriction to comparisons within strata. Somers’ D has the feature that, if Y is an outcome, then a higher–magnitude D(Y|X) cannot be secondary to a lower–magnitude D(W |X), implying that D(W|X) can be used to set an upper bound to the size of a spurious treatment effect on an outcome. For a binary treatment variable X, D(W|X) gives an upper bound to the size of a difference between the proportions, in the two treatment groups, that can be caused for a binary outcome. If D(W|X) is less than 0.5, then it can be doubled to give an upper bound to the size of a difference between the means, in the two treatment groups, that can be caused for an equal–variance Normal outcome, expressed in units of the common standard deviation for the two treatment groups.

We illustrate this method using a familiar dataset, with examples using propensity matching, weighting and stratification. We use the SSC package haif in the design phase, to check for variance inflation caused by propensity adjustment, and use the SSC package scenttest (an addition to the punaf family) to estimate the treatment effect in the analysis phase.

Additional information

newson_uksug16.pdf

newson_examples1.do

Department of Health Sciences, University of Leicester, Department of Medical Epidemiology and Biostatistics, Karolinska Institutet

michael.crowther@le.ac.uk

Multi-state models are increasingly being used to model complex disease profiles. By modelling transitions between disease states, accounting for competing events at each transition, we can gain a much richer understanding of patient trajectories and how risk factors impact over the entire disease pathway. In this talk, we will introduce some new Stata commands for the analysis of multi-state survival data. This includes msset, a data preparation tool which converts a dataset from wide (one observation per subject, multiple time and status variables) to long (one observation for each transition for which a subject is at risk for). We develop a new estimation command, stms, which allows the user to fit different parametric distributions for different transitions, simultaneously, whilst allowing sharing of covariate effects across transitions. Finally, predictms calculates transition probabilities, and many other useful measures of absolute risk, following the fit of any model using streg, stms, or stcox, using either a simulation approach or the Aalen-Johansen estimator. We illustrate the software using a dataset of patients with primary breast cancer.

crowther_uksug16.pdf

Center for Medical Biometry and Medical Informatics (IMBI), University of Freiburg Department of Mathematics and Computer Science, University of Southern Denmark

haghish@hotmail.com

The markdoc package is a minimal and lightweight literate programming package for Stata. Yet, it is highly flexible in terms of supported markup languages and output document formats. While the package is mainly known for generating dynamic analysis documents, primarily, it was programmed to provide a simple tool for teaching Stata, allowing students to actively document and interpret the analysis and results within Stata’s text editor. In this talk, I will discuss the educational potentials of markdoc for both teachers and learners and also, how it can be used in workshops and lab sessions to facilitate teaching and learning statistics.

University of Bern

ben.jann@soz.unibe.ch

At the 2009 meeting in Bonn I presented a new Stata command called textdoc. The command allowed weaving Stata code into a LaTeX document, but its functionality and its usefulness for larger projects was limited. In the meantime, I heavily revised the textdoc command to simplify the workflow and improve support for complex documents. The command is now well suited, for example, to generate automatic documentation of data analyses or even to write an entire book. In this talk I will present the new features of textdoc and provide examples of their application.

jann_uksug16.pdf

jann_example1.pdf

jann_example2.pdf

Clinical Trials and Evaluation Unit, Bristol

Lauren.Scott@bristol.ac.uk

Clinical Trials and Evaluation Unit, Bristol

Chris.Rogers@bristol.ac.uk

In many fields of statistics summary tables are used to describe characteristics within a study population. Moreover, such tables are often used to compare characteristics of two or more groups; for example treatment groups in a clinical trial or different cohorts in an observational study. This talk introduces the sumtable command, a user-written command that can be used to produce such summary tables, allowing for different summary measures within one table. Summary measures available include means and standard deviations, medians and inter-quartile ranges, numbers and percentages, etc. The command removes any manual aspect of creating these tables (e.g. copying and pasting from the Stata output window) and therefore eliminates transposition errors. It also makes creating a summary table quick and easy and is especially useful if data are updated and tables subsequently need to change. The end result is an Excel spreadsheet that can be easily manipulated for reports or other documents. Although this command was written in the context of medical statistics, it would be equally useful in many other settings.

crowther_uksug16.pdf

Department of Economics, University of Essex

School of Economics, University of Surrey

jmcss@surrey.ac.uk

One of the main reasons for the popularity of panel data is that they make it possible to account for the presence of time-invariant unobserved individual characteristics, the so- called fixed effects. Consistent estimation of the fixed effects is only possible if the number of time periods is allowed to pass to infinity, a condition that is often unreasonable in practice. However, in a small number of cases, it is possible to find methods that allow consistent estimation of the remaining parameters of the model, even when the number of time periods is fixed. These methods are based on transformations of the problem that effectively eliminate the fixed effects from the model.

A drawback of these estimators is that they do not provide consistent estimates of the fixed effects and this limits the kind of inference that can be performed. For example, in linear models it is not possible to use the estimates obtained in this way to make predictions of the variate of interest. This problem is particularly acute in non-linear models where often the parameters have little meaning and it is more interesting to evaluate partial effects on quantities of interest.

In this presentation we show that, although it is indeed generally impossible to evaluate the partial effects at points of interest, it is sometimes possible to consistently estimate quantities that are informative and easy to interpret. The problem will be discussed using Stata, centred on a new ado file for calculating the average logit elasticities.

StataCorp, College Station, TX

ddrukker@stata.com

Doctors and consultants want to know the effect of a covariate for a given covariate pattern. Policy analysts want to know a population-level effect of a covariate. I discuss how to estimate and interpret these effects using factor variables and margins.

Boston College & DIW Berlin

baum@bc.edu

University of York

We model the time series of credit default swap (CDS) spreads on sovereign debt in the Eurozone allowing for stochastic volatility and examining the effects of country-specific and systemic shocks. A weekly volatility series is produced from daily quotations on 11 Eurozone countries CDS for 2009–2010. Using Stata’s gmm command, we construct a highly nonlinear model of the evolution of realized volatility when subjected to both idiosyncratic and systemic shocks. Evaluation of the quality of the fit for the 24 moment conditions is produced by a Mata auxiliary routine. This model captures many of the features of these financial markets during a turbulent period in the recent history of the single currency. We find that systemic volatility shocks increase returns on ”virtuous” borrowers’ CDS while reducing returns for the most troubled countries’ obligations.

baum_uksug16.pdf

Spatial Economics and Econometrics Centre, Heriot-Watt University, Edinburgh

jd219@hw.ac.uk

This presentation introduces a new Stata command, xtdcce, to estimate a dynamic common correlated effects model with heterogeneous coefficients. The estimation procedure mainly follows Chudik and Pesaran (2015); in addition, the common correlated effects estimator (Pesaran 2006) as well as the mean group (Pesaran and Smith 1995) and the pooled mean group estimator (Shin et al. 1999) are supported. Coefficients are allowed to be heterogeneous or homogeneous. In addition instrumental variable regressions and unbalanced panels are supported. The Cross Sectional Dependence Test (CD Test) is automatically calculated and presented in the estimation output. Examples for empirical applications of all estimation methods mentioned above are given.

Chudik, A. and Pesaran, M.H. 2015. Large panel data models with cross-sectional dependence: A survey. In Baltagi, B.H. (ed.)

Pesaran, M. 2006. Estimation and inference in large heterogeneous panels with a multifactor error structure.

Pesaran, M.H. and Smith, R. 1995. Estimating long-run relationships from dynamic heterogeneous panels.

Shin, Y., Pesaran, M.H. and Smith, R.P. 1999. Pooled mean group estimation of dynamic heterogeneous panels.

ditzen_uksug16.pdf

School of Social and Community Medicine, University of Bristol

Rachael.Hughes@bristol.ac.uk

Department of Medical Statistics, London School of Hygiene and Tropical Medicine

School of Social and Community Medicine, University of Bristol

School of Social and Community Medicine, University of Bristol

Linear mixed-effects models are commonly used for the analysis of longitudinal biomarkers of disease. Taylor

Taylor, J., Cumberland, W. and Sy, J. 1994. A stochastic model for analysis of longitudinal AIDS data.

hughes_uksug16.pdf

Faculty of Health, Social Care and Education, Kingston and St George’s, London

robert.grant@sgul.kingston.ac.uk

Stata and Mata are very powerful and flexible for data processing and analysis, but there are some problems that can be fixed faster or more easily by using a lower-level programming language. statacpp is a command that allows users to write a C++ program, and have Stata add your data, matrices or globals into it, compile it to an executable program, run it and return the results back into Stata as more variables, matrices or globals in a do-file. The most important use cases are likely to be around big data and MapReduce (where data can be filtered and processed according to parameters from Stata, and reduced results passed into Stata) and machine learning (where existing powerful libraries such as TensorFlow can be utilised). Short examples will be shown of both these aspects. Future directions for development will also be outlined, in particular calling Stata from C++ (useful for real-time responsive analysis) and calling CUDA from Stata (useful for massively parallel processing on GPU chips).

Work in progress at https://github.com/robertgrant/statacpp

France

catherine.welch@ucl.ac.uk

Attrition is one potential bias that occurs in longitudinal studies when participants drop out and is informative when the reason for attrition is associated with the study outcome. However, this is impossible to check since the data we need to confirm informative attrition are missing. When data are missing at random (MAR), the probability of missingness not being associated with the missing values conditional on the observed data, one appropriate approach for handling missing data is multiple imputation (MI). However, when attrition results in the data being missing not at random (MNAR), the probability of missing data is associated with the values missing, so we cannot use MI directly. An alternative approach is pattern mixture modelling, which specifies the distribution of the observed data, which we know, and the missing data, which we don't know. We can estimate the missing data models, using observations about the data, and average the estimates of the two models using MI. Many longitudinal clinical trials have a monotone missing pattern (once participants drop out they do not return), which simplifies MI, so use pattern mixture modelling as a sensitivity analysis. However, in observational studies, data are missing due to non-responses and attrition, which is a more complex setting for handling attrition compared to clinical trials.

For this study, we used data from the Whitehall II study. Data was first collected on over 10,000 civil servants in 1985 and data collection phases are repeated every 2-3 years. Participants complete a health and lifestyle questionnaire and, at alternate, odd numbered phases, attend a screening clinic.

Over 30 years, many epidemiological studies used this data. One study investigated how smoking status at baseline (Phase 5) was associated with 10-year cognitive decline using a mixed model with random intercept and slope. In these analyses, the authors replaced missing values in non-responders with last observed values. However, participants with reduced cognitive function may be unable to continue participation in the Whitehall II study, which may bias the statistical analysis.

Using Stata, we will simulate 1,000 datasets with the same distributions and associations as Whitehall II to perform the statistical analysis described above. First, we will develop a MAR missingness mechanism (conditional on previously observed values) and change cognitive function values to missing. Next, for attrition, we will we use a MNAR missingness mechanism (conditional on measurements at the same phase). For both MAR and MNAR missingness mechanisms, we will compare the bias and precision from an analysis of simulated datasets without any missing data to a complete case analysis and an analysis of data imputed using MI and, additionally for MNAR missingness mechanism, we will use pattern mixture modelling. We will use the two-fold fully conditional specification (FCS) algorithm to impute missing values for non-responders and to average estimates when using pattern 9 mixture modelling. The two-fold FCS algorithm imputes each phase sequentially conditional on observed information at adjacent phases so is a suitable approach for imputing missing values in longitudinal data. The user-written package for this approach, twofold, is available on Statistical Software Components (SSC) archive. We will present the methods used to perform the study and results from these comparisons.

welch_uksug16.pdf

University of Exeter Business School

s.kripfganz@exeter.ac.uk

In this presentation, I discuss the new Stata command xtdpdqml that implements the unconditional quasi-maximum likelihood estimators of Bhargava and Sargan (1983,

The marginal distribution of the initial observations is modelled as a function of the observed variables to circumvent a short-T dynamic panel data bias. Robust standard errors are available following the arguments of Hayakawa and Pesaran (2015,

kripfganz_uksug16.pdf

Department of Geography, Durham University

n.j.cox@durham.ac.uk

Quantile plots show ordered values (raw data, estimates, residuals, whatever) against rank or cumulative probability or a one-to-one function of the same. Even in a strict sense, they are almost 200 years old. In Stata, quantile, qqplot, and qnorm go back to 1985 and 1986. So why any fuss?

The presentation is built on a long-considered view that quantile plots are the best single plot for univariate distributions. No other kind of plot shows so many features so well across a range of sample sizes with so few arbitrary decisions. Both official and user-written programs appear in a review that includes side-by-side and superimposed comparisons of quantiles for different groups and comparable variables. Emphasis is on newer, previously unpublished work, with focus on the compatibility of quantiles with transformations; fitting and testing of brand-name distributions; quantile-box plots as proposed by Emanuel Parzen (1929–2016); equivalents for ordinal categorical data; and the question of which graphics best support paired and two-sample t and other tests.

Commands mentioned include distplot, multqplot, and qplot (

Cox, N.J. 1999a. Distribution function plots.

2005. The protean quantile plot.

2007. Quantile-quantile plots without programming. Stata Journal 7: 275–279.

2012. Axis practice, or what goes where on a graph.

cox_uksug16.pptx

Institut de Recherches Économiques et Sociales, Université catholique de Louvain

sebastien.fontenay@uclouvain.be

SDMX, which stands for Statistical Data and Metadata eXchange, is a standard developed by seven international organisations (BIS, ECB, Eurostat, IMF, OECD, the United Nations and the World Bank) to facilitate the exchange of statistical data (https://sdmx.org/). The package sdmxuse aims at helping Stata users to download SDMX data directly within their favourite software. The program builds and sends a query to the statistical agency (using RESTful web services), then imports and formats the downloaded dataset (in XML format). Some initiatives, notably the SDMX connector by Attilio Mattiocco at the Bank of Italy (https://github.com/amattioc/SDMX), have already been implemented to facilitate the use of SDMX data for external users but they all rely on the Java programming language. Formatting the data directly within Stata has proved to be quicker for large datasets but it also offers a simpler way for users to address potential bugs. The last argument is of particular importance for a standard that is evolving relatively fast.

The presentation will include an explanation of the functioning of the sdmxuse program, as well as an illustration of its usefulness in the context of macroeconomic forecasting. Since the seminal work of Stock and Watson (2002), factors models have become widely used to compute early estimates (now-casting) of macroeconomic series (e.g. Gross Domestic Product). More recent works (e.g. Angelini et al. 2011) have shown that regressions on factors extracted from a large panel of time series outperform traditional bridge equations. But this trend has increased the need for datasets with many time series (often more than one hundred) that are updated immediately after new releases are made available (i.e. almost daily). The package sdmxuse should be of interest for users wanting to work on the development of such models.

Angelini, E., Camba-Mendez, G., Giannone, D., Reichlin, L. and Rünstler, G. 2011. Short- term forecasts of euro area GDP growth.

Stock, J.H. and Watson, M.W. 2002. Forecasting using principal components from a large number of predictors.

fontenay_uksug16.pdf

StataCorp, College Station, TX

ymarchenko@stata.com

Joint modeling of longitudinal and survival-time data has been gaining more and more attention in recent years. Many studies collect both longitudinal and survival-time data. Longitudinal, panel, or repeated-measures data record data measured repeatedly at different time points. Survival-time or event history data record times to an event of interest such as death or onset of a disease. The longitudinal and survival-time outcomes are often related and should thus be analyzed jointly. Three types of joint analysis may be considered: 1) evaluation of the effects of time-dependent covariates on the survival time; 2) adjustment for informative dropout in the analysis of longitudinal data; and 3) joint assessment of the effects of baseline covariates on the two types of outcomes. In this presentation, I will provide a brief introduction to the methodology and demonstrate how to perform these three types of joint analysis in Stata.

Luxembourg Institute of Socio-Economic Research

philippe.vankerm@liser.lu

Incorporating covariates in (income or wage) distribution analysis typically involves estimating conditional distribution models, that is, models for the cumulative distribution of the outcome of interest conditionally on the value of a set of covariates. A simple strategy is to estimate a series of binary outcome regression models for F(z|xi) = Pr(yi ≤ z|xi) for a grid of values for z (Peracchi and Foresi,

Department of Health Sciences, University of Leicester

si113@leicester.ac.uk

Department of Health Sciences, University of Leicester

Department of Medical Epidemiology & Biostatistics, Karolinska Institutet, Stockholm

Department of Health Sciences, University of Leicester

Modelling within competing risks is increasing in prominence as researchers are becoming more interested in real-world probabilities of a patient’s risk of dying from a disease whilst also being at risk of dying from other causes. Interest lies in the cause-specific cumulative incidence function (CIF) which can be calculated by (1) transforming on the cause-specific hazards (CSH) or (2) through its direct relationship with the subdistribution hazards (SDH).

We expand on current competing risks methodology within the flexible parametric survival modelling framework and focus on approach (2), which is more useful when we look to questions on prognosis. These can be parametrised through direct likelihood inference on the cause-specific CIF (Jeong and Fine 2006) which offers a number of advantages over the more popular Fine and Gray modelling approach (Fine and Gray 1999). Models have also been adapted for cure models using a similar approach described by Andersson et al. (2011) for flexible parametric relative survival models.

An estimation command, stpm2cr, has been written in Stata which is used to model all cause- specific CIFs simultaneously. Using SEER data, we compare and contrast our approach with standard methods and show that many useful out-of-sample predictions can be made after fitting a flexible parametric SDH model, for example, CIF ratios and CSH. Alternative link functions may also be incorporated such as the logit link leading to proportional odds models and models can be easily extended for time-dependent effects. We also show that an advantage of our approach is that it is less computationally intensive which is important particularly when analysing larger datasets.

Andersson, T.M-L., Dickman, P.W., Eloranta, S. and Lambert, P.C. 2011. Estimating and modelling cure in population-based cancer studies within the framework of flexible parametric survival models.

Fine, J.P. and Gray, R.J. 1999. A proportional hazards model for the subdistribution of a competing risk.

Jeong, J-H. and Fine, J.P. 2006. Direct parametric inference for the cumulative incidence function.

islam_uksug16.pdf

MRC Clinical Trials Unit at UCL

tim.morris@ucl.ac.uk

MRC Biostatistics Unit, Cambridge

ian.white@mrc-bsu.cam.ac.uk

University of Leicester

mjc76@leicester.ac.uk

Simulation studies are an invaluable tool for statistical research, particularly for the evaluation of a new method or comparison of competing methods. Simulations are well-used by methodologists but often conducted or reported poorly, and are underused by applied statisticians. It’s easy to execute a simulation study in Stata, but it’s at least as easy to do it wrong.

We will describe a systematic approach to getting it right, visiting:

- Types of simulation study
- An approach to planning yours
- Setting seeds and storing states
- Saving estimates with simulate and postfile
- Preparing for failed runs and trapping errors
- The three types of dataset involved in simulations
- Analysis of simulation studies
- Presentation of results (including Monte Carlo error)

This tutorial will visit concepts, code, tips, tricks and potholes, with the aim of giving the uninitiated the necessary understanding to start tackling simulation studies.

MRC Clinical Trials Unit at UCL and London School of Hygiene and Tropical Medicine

s.cro@ucl.ac.uk

The statistical analysis of longitudinal randomised clinical trials is frequently complicated by the occurrence of protocol deviations which result in incomplete data sets for analysis. However one approaches analysis, an untestable assumption about the distribution of the unobserved post-deviation data must be made. In such circumstances it is important to assess the robustness of trial results from primary analysis to different credible assumptions about the distribution of the unobserved data.

Reference based multiple imputation procedures allow trialists to assess the impact of contextually relevant qualitative missing data assumptions (Carpenter, Roger and Kenward 2013). For example, in a trial of an active versus placebo treatment, missing data for active patients can be imputed following the distribution of the data in the placebo arm. I present the mimix command which implements the reference based multiple imputation procedures in Stata, enabling relevant accessible sensitivity analysis of trial data sets.

Carpenter, J.R., Roger, J.H. and Kenward, M.G. 2013. Analysis of longitudinal trials with protocol deviation: a framework for relevant, accessible assumptions, and inference via multiple imputation. Journal of Biopharmaceutical Statistics 23(6):1352-71.

cro_uksug16.pptx

Musculoskeletal Research Unit, University of Bristol

Adrian.Sayers@bristol.ac.uk

Parallel computing has promised to deliver faster computing for everyone using off-the-shelf multi-core computers. Despite proprietary implementation of new routines in Stata MP the time required to conduct computationally intensive tasks such as bootstrapping, simulation and multiple imputation hasn’t dramatically improved.

One strategy to speed up computationally intensive tasks is to use distributed high performance computer clusters (HPC). Using HPCs to speed up computationally intensive tasks typically involves a divide and conquer approach. This simply divides repetitive tasks and distributes them across multiple processors and combines the results independently at the end of the process.

The ability to access such clusters is limited: however, a similar system can be implemented on your desktop PC using the user-written command qsub.

qsub provides a wrapper which writes, submits and monitors jobs submitted to your desktop PC and which may dramatically improve the speed in which frequent computationally intensive tasks are achieved.

×

JavaScript seem to be disabled in your browser. You must have JavaScript enabled in your browser to utilize the functionality of this website.

Timberlake Consultants