2017 London Stata Users Group Meeting - Presentation Abstracts

Ridit splines with applications to propensity weighting

Roger B. Newson
Department of Primary Care and Public Health, Imperial College London


Given a random variable X, the ridit function RX(·) specifies its distribution. The SSC package wridit can compute ridits (possibly weighted) for a variable. A ridit spline in a variable X is a spline in the ridit RX(X). The SSC package polyspline can be used with wridit to generate an unrestricted ridit-spline basis for an X-variable, with the feature that, in a regression model, the parameters corresponding to the basis variables are equal to mean values of the outcome variable at a list of percentiles of the X-variable. Ridit splines are especially useful in propensity weighting. The user may define a primary propensity score in the usual way, by fitting a regression model of the treatment variable with respect to the confounders, and then using the predicted values of the treatment variable. A secondary propensity score is then defined by regressing the treatment variable with respect to a ridit-spline basis in the primary propensity score. We have found that secondary propensity scores can predict the treatment variable as well as the corresponding primary propensity scores, as measured using the unweighted Somers' D with respect to the treatment variable. However, secondary propensity weights frequently perform better than primary propensity weights at standardizing out the treatment-propensity association, as measured using the propensity-weighted Somers' D with respect to the treatment variable. Also, when we measure the treatment effect, secondary propensity weights may cause considerably less variance inflation than primary propensity weights. This is because the secondary propensity score is less likely to produce extreme propensity weights than the primary propensity score.

Download presentation
Download the sample do-file

Nonparametric synthetic control method for program evaluation: model and Stata implementation

Giovanni Cerulli
CNR-IRCrES, National Research Council of Italy

Building on the paper by Abadie and Gardeazabal (2003) and Abadie, Diamond, and Hainmueller (2010), I extend the Synthetic Control Method for program evaluation to the case of a nonparametric identification of the synthetic (or counterfactual) time pattern of the treated unit (for instance: a country, region, city, etc.).

I discuss the advantages of this method over the one provided by previous authors and apply them to the same example of Abadie, Diamond, and Hainmueller (2010), i.e. the study of the effects of Proposition 99, a large-scale tobacco control program that California implemented in 1988.

I will also show the use of the Stata command synth provided by Abadie, Diamond, and Hainmueller (2014) and that of npsynth for nonparametric synthetic control method I implemented in Stata.

Given that many policy interventions and events of interest in social sciences take place at an aggregate level (countries, regions, cities, etc.) and affect a small number of aggregate units, the potential applicability of synthetic control methods to comparative case studies is very large, especially in situations where traditional regression methods are not appropriate.

Download presentation

A general multilevel estimation framework: Multivariate joint models and more

Michael J Crowther
Department of Health Sciences, University of Leicester

A tremendous amount of work has been conducted in the area of joint models in recent years, with new extensions constantly being developed as the methods become more widely accepted and utilised, especially as the availability of software increases. In this talk I will introduce work focused on developing an over-arching general framework, and usable software implementation called megenreg, for estimating many different types of joint models. This will allow the user to fit a model with any number of outcomes, each of which can be of various types (continuous, binary, count, ordinal, survival), with any number of levels, and with any number of random effects at each level. Random effects can then be linked between outcomes in a number of ways. Of course, all of this is nothing new, and can be done (far better) with gsem. My focus, and motivation for writing my own simplified/extended gsem is to extend the modelling capabilities to allow the inclusion of the expected value of an outcome (possibly time-dependent) or its gradient or integral or general function of it, in the linear predictor of another. Furthermore, I develop simple utility functions to allow the user to extend to non-standard distributions in an extremely simple way with a little Mata function, whilst still providing the complex syntax users of gsem will be familiar with. I’ll focus on a special case of the general framework, joint modelling of multivariate longitudinal outcomes and survival, and in particular discuss some of the challenges faced in estimating such complex models, such as high dimensional random effects, and describe how we can relax the normally distributed random effects assumption. I’ll also describe many new methodological extensions, particularly in the field of survival analysis, each of which is simple to implement in megenreg.

Download presentation

On the shoulders of giants, or not reinventing the wheel

Nicholas J. Cox
Durham University

Part of the art of coding is writing as little as possible to do as much as possible. The presentation expands on this truism. Examples are given of Stata code to yield graphs and tables in which most of the real work is happily delegated to workhorse commands.

In graphics a key principle is that graph twoway is the most general command, even when you do not want rectangular axes. Variations on scatter and line plots are precisely that, variations on scatter and line plots. More challenging illustrations include commands for circular and triangular graphics, in which x and y axes are omitted, with an inevitable but manageable cost in re-creating scaffolding, titles, labels and other elements.

In tabulations and listings the better known commands sometimes seem to fall short of what you want. However, a few preparation commands (such as generate, egen, collapse or contract) followed by list, tabdisp or _tab can get you a long way.

The examples range in scope from a few lines of interactive code to fully developed programs. The presentation is thus pitched at all levels of Stata users.

Download presentation

Scheme scheme, plot plot: DIY graph schemes in Stata

Tim Morris
MRC Clinical Trials Unit, University College London

Stata includes many options to change design elements of graphs. Invoking these may be necessary to satisfy corporate branding guidelines or journal formatting requirements, or desirable due to personal taste. Whatever the reason, many options get used repeatedly -- some in every graph -- and the code required to produce a single publication-ready figure can run over tens of lines. Changing scheme can reduce the number of options required. What many people are unaware of is that it is simple to write-your-own personal graph scheme, greatly reducing the number of lines of code needed for any given graph command. Opening a graph scheme file reveals how non-intimidating modifying a scheme is. This presentation encourages users to ‘scheme scheme, plot plot’, showing very simple and more complex examples, and the coding effort that this can save.

Download presentation

Unemployment duration and re-employment wages: a control function approach

Marta C. Lopes
Nova School of Business and Economics

In the context of instrumental variables (IV) approach, the control function has been widely used in the Applied Econometrics literature. The main objective is the same: to find (at least) one instrumental variable which explains the variation in the endogenous explanatory variable (EEV) of the structural equation. Once this goal is accomplished, the researcher should regress the EEV on the exogenous variables excluded from the structural equation (instrumental variables). From this regression, usually denoted as first stage, one should obtain the generalized residuals and plug them into the structural equation (second stage). These residuals will then serve as a control function to transform the EEV into an appropriate exogenous variable.

The main advantage of this method is that, unlike the two-stage least squares approach (2SLS), it can be applied to non-linear models (Wooldridge, 2015). Such situation arises when the outcome variable of the structural equation is discrete, truncated, or censored, for example. The estimation of a non-linear model, as opposed to the typical ordinary least squares regression (OLS), can also be required in the first stage. In this paper I provide an application to the later by estimating an accelerated failure model to explain the unemployment duration (my EEV).

In order to apply the control function to non-linear models, Stata only offers the command etregress at the moment, which allows for a binary treatment variable. To complement this option, I hereby propose a user-written program which allows for a censored treatment variable. As the program is directed to duration models, the user will be able to choose the type of survival analysis to perform in the first stage. Due to the separate estimation of each stage the program calculates bootstrapped standard errors for the second stage.

Download presentation

eltmle: Ensemble learning targeted maximum likelihood estimation

Miguel-Angel Luque Fernandez
London School of Hygiene and Tropical Medicine

Modern Epidemiology has been able to identify significant limitations of classic epidemiological methods, like outcome regression analysis, when estimating causal quantities such as the average treatment effect (ATE) for observational data. For example, using classical regression models to estimate the ATE requires assuming the effect measure is constant across levels of confounders included in the model, i.e. that there is no effect modification. Other methods do not require this assumption, including g-methods (e.g. the g-formula) and targeted maximum likelihood estimation (TMLE). Many estimators of the ATE but not all rely on parametric modeling assumptions. Therefore, the correct model specification is crucial to obtain unbiased estimates of the true ATE. TMLE is a semiparametric, efficient substitution estimator allowing for data-adaptive estimation while obtaining valid statistical inference based on the targeted minimum loss-based estimation. Being doubly robust, TMLE allows inclusion of machine learning algorithms to minimise the risk of model misspecification, a problem that persists for competing estimators. Evidence shows that TMLE typically provides the least unbiased estimates of the ATE compared with other double robust estimators. eltmle is a Stata program implementing the targeted maximum likelihood estimation for the ATE for a binary outcome and binary treatment. eltmle includes the use of a super-learner called from the Super Learner R-package v.2.0-21 (Polley E., et al. 2011). The Super-Learner uses V-fold cross-validation (10-fold by default) to assess the performance of prediction regarding the potential outcomes and the propensity score as weighted averages of a set of machine learning algorithms. We used the default Super Learner algorithms implemented in the base installation of the tmle-R package v.1.2.0-5 (Susan G. and Van der Laan M., 2017), which included the following: i) stepwise selection, ii) generalized linear modelling (GLM), iii) a GLM variant that includes second order polynomials and two-by-two interactions of the main terms included in the model. Additionally, eltmle users will have the option to include Bayes Generalized Linear Models and Generalised Additive Models as additional Super-Learner algorithms. Future implementations will offer more advanced machine learning algorithms.

Download presentation

Estimating mixture models for environmental noise assessment

Gordon Hughes
School of Economics, University of Edinburgh

Environmental noise – linked to traffic, industrial activities, wind farms, etc. – is a matter of increasing concern as its association with sleep deprivation and a variety of health conditions has been studied in more detail. The framework used for noise assessments assumes that there is a basic level of background noise, which will often vary with time of day and spatially across monitoring locations, plus additional noise components from random sources such as vehicles, machinery or wind affecting trees. The question that has to be investigated is whether, and by how much, the noise at each location will be increased by the addition of one or more new sources of noise such as a road, a factory or a wind farm.

The paper adopts a mixtures specification to identify heterogeneity in the sources and levels of background noise. In particular, it is important to distinguish between sources of background noise that may be associated with covariates of noise from a new source and other sources that are independent of these covariates. A further consideration is that noise levels are not additive, though sound pressures are.

The analysis uses an extended version of Partha Deb’s Stata command (fmm) for estimating finite mixture models. The extended command allows for the imposition of restrictions such as that not all components are affected by the covariates or that the probabilities that particular components are observed depend upon exogenous factors. These extensions allow for a richer specification of the determinants of observed noise levels. The extended command is supplemented by post-estimation commands which use Monte Carlo methods to estimate how a new source will affect the noise exposure at different locations and how outcomes may be affected by noise control measures. The goal is to produce results that can be understood by decision-makers with little or no statistical background.

Download presentation

Sequential (two-stage) estimation of linear panel data models

Sebastian Kripfganz
University of Exeter Business School

I present the new Stata command xtseqreg that implements sequential (two-stage) estimators for linear panel data models. In general, the conventional standard errors are no longer valid in sequential estimation when the residuals from the first stage are regressed on another set of (often time-invariant) explanatory variables at a second stage. xtseqreg computes the analytical standard-error correction of Kripfganz and Schwarz (ECB Working Paper 1838, 2015) that accounts for the first-stage estimation error. The command can be used to fit both stages of a sequential regression or either stage separately. OLS and 2SLS estimation are supported, as well as one-step and two-step "difference"-GMM and "system"-GMM estimation with flexible choice of the instruments and weighting matrix. Available postestimation statistics include the Arellano-Bond test for absence of autocorrelation in the first-differenced errors and Hansen's J-test for the validity of the overidentifying restrictions. While it is not intended to introduce xtseqreg as a competitor for existing commands, it can mimic part of their behaviour. In particular, xtseqreg can replicate results obtained with xtdpd and xtabond2. In that regard, I will illustrate some common pitfalls in the estimation of dynamic panel models.

Download presentation

Response surface models for the Elliott, Rothenberg, Stock DF-GLS unit root test

Kit Baum
Boston College
Jesús Otero
Universidad del Rosario, Bogotá

We present response surface coefficients for a large range of quantiles of the Elliott, Rothenberg and Stock (Econometrica, 1996) DF-GLS unit root tests, for different combinations of the number of observations and the lag order in the test regressions, where the latter can be either specified by the user or endogenously determined. The critical values depend on the method used to select the number of lags. The Stata command ersur is presented, and its use illustrated with an empirical example that tests the validity of the expectations hypothesis of the term structure of interest rates. The command will be of interest to applied econometricians. More generally, it illustrates how a rather cumbersome lookup table can be handled in the context of a Stata package.

Download presentation

kmatch: Kernel matching with automatic bandwidth selection

Ben Jann
University of Bern

In this talk I will present a new matching software for Stata called kmatch. The command matches treated and untreated observations with respect to covariates and, if outcome variables are provided, estimates treatment effects based on the matched observations, optionally including regression adjustment bias-correction. Multivariate (Mahalanobis) distance matching as well as propensity score matching is supported, either using kernel matching, ridge matching, or nearest-neighbor matching. For kernel and ridge matching, several methods for data-driven bandwidth selection such as cross-validation are offered. The package also includes various commands for evaluating balancing and common-support violations. A focus of the talk will be on how kernel and ridge matching with automatic bandwidth selection compare to nearest-neighbor matching.

Download presentation

Estimation and inference for quantiles and indices of inequality and poverty with survey data: leveraging built-in support for complex survey design and multiply imputed data

Philippe Van Kerm
Luxembourg Institute for Social and Economic Research

Stata is the software of choice for many analysts of household surveys, in particular for poverty and inequality analysis. No dedicated suite of command comes bundled with the software, but many user-written commands are freely available for the estimation of various types of indices. This talk will present a set of new tools which complement and significantly upgrade some existing packages. The key feature of the new packages is their ability to leverage Stata's built-in capacity for dealing with survey design features (via the svy prefix), resampling methods (via the bootstrap, jackknife or permute prefixes), multiply imputed data (via mi) and various post-estimation commands for testing purposes. The talk will review basic indices, outline estimation and inference for such non-linear statistics with survey data, show programming tips, and illustrate various uses of the new commands.

Download presentation

Fitting Bayesian regression models using the bayes prefix

Yulia Marchenko
StataCorp, College Station, TX

Stata 15 introduces the new bayes prefix for fitting Bayesian regression models more easily. It combines Bayesian features with Stata's intuitive and elegant specification of regression models. For example, you fit classical linear regression by using

. regress y x1 x2

You can now fit Bayesian linear regression by using

. bayes: regress y x1 x2

In addition to normal linear regression, the bayes prefix supports over 50 likelihood models including models for continuous, binary, ordinal, categorical, count, censored, survival outcomes, and more. All of Stata's Bayesian features are supported with the bayes prefix. In my presentation, I will demonstrate how to use the new bayes prefix to fit a variety of Bayesian regression models including survival and sample-selection models.

Download presentation

mrrobust: a Stata package for MR-Egger regression type analyses

Tom Palmer
Department of Mathematics and Statistics, Lancaster University

MR-Egger regression analyses are becoming increasingly common in Mendelian randomization studies (MR) (Bowden et al. 2015). MR-Egger analyses use summary level data, as reported by genome-wide association studies. Such data is conveniently available from the MR-base platform (Hemani et al., 2016).

MR-Egger and related methods treat a multiple instrument MR analysis as a meta-analysis across the multiple genotypes. In the MR-Egger approach, bias from the pleiotropic effects of the multiple genotypes is treated as small study reporting bias in meta-analysis. They represent an important quality control check for any MR analysis incorporating multiple genotypes.

We implemented several of these methods (inverse-variance weighted [IVW], MR-Egger and weighted median approaches, as well as a relevant plot) in a package for Stata called mrrobust (pleiotropy robust methods for MR). There are also implementations of these methods in R (Yavorska and Burgess 2016).

mrrobust is freely available from https://github.com/remlapmot/mrrobust, which includes instructions on how to install the package from within Stata. We plan to add features overtime.


Bowden J, Davey Smith G, Burgess S. Mendelian randomization with invalid instruments: effect estimation and bias detection through Egger regression. International Journal of Epidemiology, 2015, 44, 2, 512–525.

Hemani G, Zheng J, Wade KH, et al., Davey Smith G, Gaunt TR, Haycock PC. The MR-Base Collaboration. MR-Base: a platform for systematic causal inference across the phenome using billions of genetic associations. bioRxiv, 2016, doi: https://doi.org/10.1101/078972; http://www.mrbase.org/.

Yavorska O, Burgess S. MendelianRandomization: Mendelian Randomization Package. 2016, version 0.2.0. https://CRAN.R-project.org/package=MendelianRandomization

Download presentation

Group sequential clinical trial designs for normally distributed outcome variables

Michael Grayling
James Wason
Adrian Mander

MRC Biostatistics Unit, University of Cambridge

In a group sequential trial, accumulated data are analysed at numerous time-points in order to allow early decisions to be made about a hypothesis of interest. These designs have historically been recommended for their ethical, administrative and economic benefits, and indeed have a long history of use in clinical research.

In this presentation, we begin by discussing the theory behind these designs. Then, we describe a collection of new Stata commands for computing the stopping boundaries and required group size of various classical group sequential designs, assuming a normally distributed outcome variable. Following this, we demonstrate how the performance of several designs can be compared graphically. We conclude by discussing the many possible future extensions of this work.

Download presentation

piecewise_ginireg – A Stata package to run piecewise Gini regressions

Jan Ditzen
Centre for Energy Economics Policy and Research, Heriot-Watt University
Shlomo Yitzhaki
The Hebrew University and Hadassah Academic College, Israel

This presentation introduces piecewise_ginireg, an extension to Mark Schaffer’s “ginireg” command. Gini regressions are based on the Gini’s Mean Difference as a measure of dispersion and the estimator can be interpreted as a weighted average of slopes. (See for example Olkin and Yitzhaki, “Gini Regression Analysis”, International Statistical Review 60 (1992): 185–196.) In comparison to a simple OLS regression, the covariance is replaced by the Gini covariance. piecewise_ginireg splits the dataset into subsets and allows an estimation for each of these subsets, giving the possibility to gain estimated coefficients for each of the subsets and to test if the linearity assumption is held by the data. In comparison to a regular Gini regression, piecewise_ginireg runs several Gini regressions on subsets of the data. As a first step piecewise_ginireg runs a normal Gini regression on the entire dataset (Iteration 0). The estimated coefficients are saved, the residuals computed and the LMA curve calculated. The LMA allows an interpretation how the Gini covariance is composed. In the next iterations the dataset is split into separate parts defined by a rule. piecewise_ginireg allows different rules such as the min or max of the LMA or were the LMA crosses the origin. On each of the sections a Gini regression is performed, where the dependent variable is the error term of the preceding iteration. After each iteration the coefficients are saved, residuals and the LMA calculated. piecewise_ginireg allows the user to specify the maximum number of iterations is several ways. It is possible to set a fixed number or until the normality conditions of the error terms holds.

Download presentation

Three serial correlation tests for panel data regression models

Jesse Wursten
KU Leuven

The default method to calculate standard errors in regression models requires idiosyncratic errors (uncorrelated on any dimension). More general methods exist (e.g. HAC and clustered errors) but are not always feasible, especially in smaller datasets or those with a complicated (correlation) structure. However, if your residuals are uncorrelated, the default standard errors might actually suffice and be more reliable than their cluster robust version. In this presentation, I present three new panel serial correlation tests which can be used to look for correlation along the first dimension (‘within’ groups). Likewise, I present two new(-ish) commands to test for correlation in the second dimension (‘between’ groups). These commands are faster, more versatile and robust than existing ones (e.g. xtserial, abar).

Download presentation

Local maxima in the estimation of the ZINB and sample selection models

João M.C. Santos Silva
University of Surrey

It is well known that likelihood functions may have multiple (local) maxima. Unfortunately, the algorithms Stata uses to estimate some popular non-linear models can converge to local maxima of the likelihood function, and in these cases the results obtained are meaningless. This is a serious problem and users do not seem to be aware of it. In this presentation I use both the heckman and zinb commands to illustrate this problem. As an aside, I also note that Stata uses the incorrect version of Voung’s test to compare zero-inflated models with their standard counterparts.

Download presentation

Causal inference with sample selection

David Drukker
StataCorp, College Station, TX

I discuss how to use Stata 15’s extended regression model (ERM) commands to estimate average causal effects when the outcome is censored or when the sample is endogenously selected. I also discuss how to use these commands to estimate causal effects in the presence of endogenous explanatory variables, which these commands also accommodate.

Download presentation

Extending Stata graphics via SVG manipulation commands

Robert Grant
Tim Morris
MRC Clinical Trials Unit, University College London

Among the many new features in Stata version 14, arguably the most exciting was completely unheralded: Stata graphs could now be exported to SVG (Scalable Vector Graphics) format. SVG is a great option for storing graphical outputs because it is compact but images can be enlarged without becoming blurred or pixelated. It is also relatively human-readable, as the .svg files are plain text XML code. As output by Stata, this is particularly readable. We present three new commands as the start of a larger SVG manipulation package, which amend an existing Stata-generated .svg file to add features that are not available within Stata. Individual objects such as markers or lines can be made semi-transparent, the SVG can be embedded within a web page with some interactivity such as popup information, and a scatterplot can be converted to a hexagonal bin (2-dimensional histogram).

Presentation to follow

Robust statistics in Stata

Ben Jann
University of Bern
Vincenzo Verardi
University of Namur, Free University of Brussels, and FNRS

Statistical methods often rely on restrictive assumptions that are expected to be (approximately) true in real life situations. For example, many classical statistical models ranging from descriptive statistics to regression models and/or multivariate analysis are based on the assertion that data are normally distributed.

The main justification for assuming a normal distribution is that it generally approximates well many real life situations and, more conveniently, allows the derivation of explicit formulas for optimal statistical methods such as maximum likelihood estimators. However, the normality assumption may be violated in practice and results obtained via classical estimations may be uninformative or misleading. For example, it can happen that the vast majority of the observations are approximately normally distributed as assumed but a small cluster of so-called outliers is generated from a different distribution.

In this situation, classical estimation techniques may break down and not convey the desired information. To deal with such limitations, robust statistical techniques have been developed. In this talk we will give a brief overview of situations in which robust methods should be used. We will start by describing "theoretically" such techniques in descriptive analysis, regression models, and multivariate statistics. We will then present some robust packages that have been implemented to make these estimators available (and fast to compute) in Stata.

This talk is related to a forthcoming Stata Press book we are writing.

Download presentation

Report to users, wishes and grumbles

William Gould & colleagues
StataCorp, College Station, TX

StataCorp representatives will be given the floor, aiming to report on recent developments at StataCorp, and to discuss wishes, grumbles, and suggestions for further development with users.


Post your comment

Timberlake Consultants