The **17th London Stata Users Group Meeting** took place on **15-16 September 2011 ** at **Cass Business School, London, UK**.

The **Stata Users Group Meeting** is a two-day international conference where the use of Stata is discussed across a wide-ranging breadth of fields and environments. Established in 1995, the UK meeting is the longest-running series of Stata Users Group Meetings. The meeting is open to everyone. In past years, participants have travelled from around the world to attend the event. Representatives from StataCorp are also in attendance.

View the abstracts and download the presentations for the 17th London Stata Users Group Meeting below:

Roger B. Newson

National Heart and Lung Institute, Imperial College London

Splines, including polynomials, are traditionally used to model nonlinear relationships involving continuous predictors. However, when they are included in linear models (or generalized linear models), the estimated parameters for polynomials are not easy for nonmathematicians to understand, and the estimated parameters for other splines are often not easy even for mathematicians to understand. It would be easier if the parameters were differences or ratios between the values of the spline at the reference points and the value of the spline at a base reference point or if the parameters were values of the polynomial or spline at reference points on the

UK11_newson.pdf

UK11_newson_dofiles1.zip

Robert Grant

Kingston University and St. George’s University of London

Background: Random effects are commonly modeled in multilevel, longitudinal, and latent-variable settings. Rather than estimating fixed effects for specific clusters of data, “predictions” can be made as the mode or mean of posterior distributions that arise as the product of the random effect (an empirical Bayes prior) and the likelihood function conditional on cluster membership.

Analyses and data: This presentation will explore the experiences and lessons learned in using the bootstrap for inference on random-effects predictors following logistic regression models conducted through both

Results and considerations: Multilevel modeling and prediction are both computer-intensive, and so bootstrapping them is especially time-consuming. Examples from do-files with some helpful approaches will be shown. A small proportion of modal best linear unbiased predictors contained errors, possibly arising from the prediction algorithm. Various bootstrap confidence intervals exhibited problems such as excluding the point prediction and degeneracy. Methods for tracing the source will be presented.

Conclusion: Bootstrapping provides flexible but time-consuming inference for individual clusters’ predictions. However, there are potential problems that analysts should be aware of.

UK11_Grant.ppt

Ian White

MRC Biostatistics Unit, Cambridge

Any analysis with incomplete data makes untestable assumptions about the missing data, and analysts are therefore urged to conduct sensitivity analyses. Ideally, a model is constructed containing a nonidentifiable parameter

UK11_White.pdf

Adrian Mander

MRC Biostatistics Unit Hub for Trials Methodology, Cambridge

One of the aims of a phase I trial in oncology is to find the maximum tolerated dose. A set of doses is administered to participants starting from the lowest dose in increasing steps. To do this safely, the toxicity of each dose is assessed, and a decision is made about whether to proceed with the next highest dose until the desired target toxicity level is found. A suitable dose is then chosen to take forward into phase II studies to discover whether this drug is efficacious. The majority of oncology phase I trials use algorithm-based rules such as the 3 + 3 design to escalate doses; the 3 + 3 design is easy to implement by nonstatisticians but is statistically inefficient. Other designs, such as the continual reassessment method (O’Quigley, Pepe, and Fisher 1990), use a model to help guide the decision of which dose to give. The complexity of the CRM and that it requires software may be reasons why it is not more widely used. This talk will describe a new command

UK11_Mander.pdf

Arne Risa Hole

University of Sheffield

Joint with Andy Dickerson and Luke Munford

It is well-known that the dummy variable estimator for the fixed-effects ordered logit model is inconsistent when

UK11_Hole.pdf

Tom Palmer

MRC CAiTE Centre, School of Social and Community Medicine, University of Bristol

Joint with Roger Harbord, Paul Clarke, and Frank Windmeijer

In this talk we describe how to fit structural mean models (SMMs), as proposed by Robins, using instrumental variables in the generalized method of moments (GMM) framework using Stata’s

UK11_palmer_handouts.pdf

UK11_palmer_presentation.pdf

Michael J. Crowther

Department of Health Sciences, University of Leicester

Joint with Keith R. Abrams and Paul C. Lambert

The joint modeling of longitudinal and time-to-event data has exploded in the methodological literature in the past decade; however, the availability of software to implement the methods lags behind. The most common form of joint model assumes that the association between the survival and longitudinal processes are underlined by shared random effects. As a result, computationally intensive numerical integration techniques such as Gauss–Hermite quadrature are required to evaluate the likelihood. We describe a new user-written command

UK11_crowther.pdf

Michael Wallace

London School of Hygiene and Tropical Medicine

Measurement error in exposure variables can lead to bias in effect estimates, and methods that aim to correct this bias often come at the price of greater standard errors (and so, lower statistical power). This means that standard sample size calculations are inadequate and that, in general, simulation studies are required. Our routine

UK11_wallace.ppt

David Boniface

Epidemiology and Public Health, University College London

Aim: To create a web-based facility for customers to enter an address of a house and obtain a graph showing the trend of price of house since last sold, extrapolated to current date, within milliseconds.

Method: The UK Land Registry of house sale prices was used to estimate mean price trends from 2000 to 2010 for each category of house. The Stata ado-file

Challenges: use of coded date, choice of user knots for splines, saving and retrieving the knots and parameter estimates, use of log scale for prices to deal with skewed price distribution, estimation of prediction intervals, and the 2009 slump in house prices

UK11_boniface.ppt

Alfonso Miranda

Institute of Education, University of London

Joint with Massimiliano Bratti

We propose an estimator for models in which an endogenous dichotomous treatment affects a count outcome in the presence of either sample selection or endogenous participation using maximum simulated likelihood. We allow for the treatment to have an effect on both the participation or the sample selection rule and on the main outcome. Applications of this model are frequent in—but not limited to—health economics. We show an application of the model using data from Kenkel (Kenkel and Terza, 2001,

UK11_Miranda.pdf

Jin Hyuk Lee

Texas A&M Health Science Center

Joint with John Huber Jr.

Multiple imputation (MI) is known as an effective method for handling missing data. However, it is not clear that the method will be effective when the data contain a high percentage of missing observations on a variable. This study examines the effectiveness of MI in data with 10% to 80% missing observations using absolute bias and root mean squared error of MI measured under missing completely at random, missing at random, and not missing at random assumptions. Using both simulated data drawn from multivariate normal distribution and example data from the Predictive Study of Coronary Heart Disease, the bias and root mean squared error using MI are much smaller than of the results when complete case analysis is used. In addition, the bias of MI is consistent regardless of increasing imputation numbers (M) from M = 10 to M = 50. Moreover, compared to the regression method and predictive mean matching method, the Markov chain Monte Carlo method can also be used for continuous and univariate missing variables as an imputation mechanism. In conclusion, MI produces less-biased estimates, but when large proportions of data are missing, other things need to be considered such as the number of imputations, imputation mechanisms, and missing data mechanisms for proper imputation.

UK11_lee.pptx

Irene Petersen

University College London

Joint with Catherine Welch, Jonathan Bartlett, Ian White, Richard Morris, Louise Marston, Kate Walters, Irwin Nazareth, and James Carpenter

Multiple imputation is increasingly regarded as the standard method to account for partially observed data, but most methods have been based on cross-sectional imputation algorithms. Recently, a new multiple-imputation method, the two fold fully conditional specification (FCS) method, was developed to impute missing data in longitudinal datasets with nonmonotone missing data. (See Nevalainen J., Kenward M.G., and Virtanen S.M. 2009. Missing values in longitudinal dietary data: A multiple imputation approach based on a fully conditional specification.

UK11_welch.pptx

Gordon Hughes

School of Economics, University of Edinburgh

Econometricians have begun to devote more attention to spatial interactions when carrying out applied econometric studies. In part, this is motivated by an explicit focus on spatial interactions in policy formulation or market behavior, but it may also reflect concern about the role of omitted variables that are or may be spatially correlated. The classic models of spatial autocorrelation or spatial error rely upon a predefined matrix of spatial weights

At present, Stata’s spatial procedures include a range of user-written routines that are designed to deal with cross-sectional spatial data. The recent release of a set of programs (including

The theoretical literature on econometric models for the analysis of spatial panels has flourished in the last decade with notable contributions from LeSage and Pace, Elhorst, and Pfaffermayr, among others. In some cases, authors have made available specific code for the implementation of the techniques that they have developed. However, the programming language of choice for such methods has been MATLAB, which is expensive and has a fairly steep learning curve for nonusers. Many of the procedures assume that there are no missing data and the procedures may not be able to handle large datasets because the model specifications can easily become unmanageable if either

The presentation will cover a set of user-written maximum likelihood procedures for fitting models with a variety of spatial structures including the spatial error model, the spatial Durbin model, the spatial autocorrelation model, and certain combinations of these models—the terminology is attributable to LeSage and Pace (2009). A suite of MATLAB programs to fit these models for both random and fixed effects has been compiled by Elhorst (2010) and provides the basis for the implementation in Stata/Mata. Methods of dealing with missing data, including the implementation of an approach proposed by Pfaffermayr (2009), will be discussed. The problem of missing data is most severe when data on the dependent variable are missing in the spatial autocorrelation model because it means that information on spatial interactions may be greatly reduced by the exclusion of countries or other panel units. In such cases, some form of imputation may be essential, so the presentation will consider alternative methods of imputation. It should be noted that

A second aspect of spatial panel models that will be covered in the presentation concerns the links between such models and random-coefficient models that can be fit using procedures such as

The user-written procedures introduced in the presentation will be illustrated by applications drawn from analyses of demand for infrastructure, health outcomes, and climate for cross-country data covering the developing and developed world plus regions in China.

UK11_hughes.pdf

Vince Wiggins

StataCorp LP

We will discuss SEM (structural equation modeling), not from the perspective of the models for which it is most often used—measurement models, confirmatory factor analysis, and the like—but from the perspective of how it can extend other estimators. From a wide range of choices, we will focus on extensions of mixed models (random and fixed-effects regression). Extensions include conditional effects (not completely random), endogenous covariates, and others.

UK11_Wiggins.pdf

Yulia Marchenko

StataCorp LP

I present the new Stata 12 command,

UK11_marchenko.pdf

Joachim De Weerdt

Economic Development Initiatives, Tanzania

Researchers typically spend significant amounts of time cleaning and labeling data files in preparation of analyses of survey data. Computer-assisted personal interviewing (CAPI) gives the ability to automate this process. First, consistency checks can be run during the interview so that only data that passes autogenerated and user-written validation tests comes back from the field. Second, CAPI allows for the autogeneration of a Stata do-file that labels data files. This presentation discusses the Stata export procedure used by

UK11_deweerdt.pdf

J. Charles Huber Jr.

Texas A&M University

Joint with Michael Hallman, Victoria Friedel, Melissa Richard and Huandong Sun

Modern genetic genome-wide association studies typically rely on single nucleotide polymorphism (SNP) chip technology to determine hundreds of thousands of genotypes for an individual sample. Once these genotypes are ascertained, each SNP alone or in combination is tested for association outcomes of interest such as disease status or severity. Project Heartbeat! was a longitudinal study conducted in the 1990s that explored changes in lipids and hormones and morphological changes in children from 8 to 18 years of age. A genome-wide association study is currently being conducted to look for SNPs that are associated with these developmental changes. While there are specialty programs available for the analysis of hundreds of thousands of SNPs, they are not capable of modeling longitudinal data. Stata is well equipped for modeling longitudinal data but cannot load hundreds of thousands of variables into memory simultaneously. This talk will briefly describe the use of Mata to import hundreds of thousands of SNPs from the Illumina SNP chip platform and how to load those data into Stata for longitudinal modeling.

UK11_Huber.pptx

Adam Jacobs

Dianthus Medical Limited

The Clinical Data Interchange Standards Consortium (CDISC) is a globally relevant nonprofit organization that defines standards for handling data in clinical research. It produces a range of standards for clinical data at various stages of maturity. One of the most mature standards is the Study Data Tabulation Model, which provides a standardized yet flexible data structure for storing entire databases from clinical trials. A related standard is the Analysis Dataset Model, which defines datasets that can be used for analyzing data from clinical trials. I shall explain how the CDISC standards work, how Stata can simplify many of the routine tasks encountered in handling CDISC datasets, and the great efficiencies that can result from using datasets in a standardized structure.

UK11_jacobs.ppt

Philippe Van Kerm

CEPS/INSTEAD, Luxembourg

This talk presents a simple but effective graphical device for visualization of patterns of income mobility. The device in effect uses color palettes to picture information contained in transition matrices created from a fine partition of the marginal distributions. The talk explains how these graphs can be constructed using the user-written package

UK11_vankerm.pdf

George Leckie

Centre for Multilevel Modelling, University of Bristol

Joint with Chris Charlton

Multilevel analysis is the statistical modeling of hierarchical and nonhierarchical clustered data. These data structures are common in social and medical sciences. Stata provides the

UK11_leckie.do

UK11_leckie.pdf

Ben Jann

University of Bern

Eliciting truthful answers to sensitive questions is an age-old problem in survey research. Respondents tend to underreport socially undesired or illegal behaviors while overreporting socially desirable ones. To combat such response bias, various techniques have been developed that are geared toward providing the respondent greater anonymity and minimizing the respondent’s feelings of jeopardy. Examples of such techniques are the randomized response technique, the item-count technique, and the crosswise model. I will present results from several surveys, conducted among university students, that employ such techniques to measure the prevalence of plagiarism and cheating in exams. User-written Stata programs for analyzing data from such techniques are also presented.

UK11_jann.pdf

Tim Collier

London School of Hygiene and Tropical Medicine

When I first met Stata in October 2000, my golf handicap was 27 and my game was going nowhere slowly. Ten years of intensive Stata therapy later, my handicap is 17.3 and falling. It would, of course, be nonsense to infer from this data that lowering your handicap increases Stata use, but could the reverse be true? Could there be a causal relationship between increasing Stata use and a decreasing handicap? In this presentation, I argue that, yes, there is. Granted, Stata might not work along the lines of traditional golf training aids, but rather its effect is mediated through a third factor, namely time. Golf consumes time. Stata produces time. In this presentation, I will demonstrate how minutes in Stata’s programming world are equivalent to hours in the real world, and by the use of programs within programs, minutes can translate to days. Although extrapolation from an

UK11_Collier.ppt

Nicholas J. Cox

Durham University

Functions in Stata range between those you know you want and those you don’t know you need. The word “functions” is heavily overloaded in Stata; here the focus is on functions’ strict sense, _variables, extended macro functions, and

UK11_Cox_functions.html

UK11_Cox_functions.smcl

Markus Eberhardt

University of Oxford

Stata already has an extensive range of built-in and user-written commands for analyzing

UK11_eberhardt.pdf