## Overview

The **17th London Stata Users Group Meeting** took place on **15-16 September 2011 ** at **Cass Business School, London, UK**.

The **Stata Users Group Meeting** is a two-day international conference where the use of Stata is discussed across a wide-ranging breadth of fields and environments. Established in 1995, the UK meeting is the longest-running series of Stata Users Group Meetings. The meeting is open to everyone. In past years, participants have travelled from around the world to attend the event. Representatives from StataCorp are also in attendance.

## Proceedings

View the abstracts and download the presentations for the 17th London Stata Users Group Meeting below:

### Sensible parameters for polynomials and other splines

Roger B. Newson

National Heart and Lung Institute, Imperial College London

Splines, including polynomials, are traditionally used to model nonlinear relationships involving continuous predictors. However, when they are included in linear models (or generalized linear models), the estimated parameters for polynomials are not easy for nonmathematicians to understand, and the estimated parameters for other splines are often not easy even for mathematicians to understand. It would be easier if the parameters were differences or ratios between the values of the spline at the reference points and the value of the spline at a base reference point or if the parameters were values of the polynomial or spline at reference points on the

*x*-axis, or The

**bspline** package can be downloaded from Statistical Software Components, and generates spline bases for inclusion in the design matrices of linear models, based on Schoenberg

*B*-splines. The package now has a recently added module

**flexcurv**, which inputs a sequence of reference points on the

*x*-axis and outputs a spline basis, based on equally spaced knots generated automatically, whose parameters are the values of the spline at the reference points. This spline basis can be modified by excluding the spline vector at a base reference point and including the unit vector. If this is done, then the parameter corresponding to the unit vector will be the value of the spline at the base reference point, and the parameters corresponding to the remaining reference spline vectors will be differences between the values of the spline at the corresponding reference points and the value of the spline at the base reference point. The spline bases are therefore extensions, to continuous factors, of the bases of unit vectors and/or indicator functions used to model discrete factors. It is possible to combine these bases for different continuous and/or discrete factors in the same way, using product bases in a design matrix to estimate factor-value combination means and/or factor-value effects and/or factor interactions.

**Additional information**
UK11_newson.pdf
UK11_newson_dofiles1.zip
### Experiences and lessons learned from bootstrapping random-effects predictions

Robert Grant

Kingston University and St. George’s University of London

Background: Random effects are commonly modeled in multilevel, longitudinal, and latent-variable settings. Rather than estimating fixed effects for specific clusters of data, “predictions” can be made as the mode or mean of posterior distributions that arise as the product of the random effect (an empirical Bayes prior) and the likelihood function conditional on cluster membership.

Analyses and data: This presentation will explore the experiences and lessons learned in using the bootstrap for inference on random-effects predictors following logistic regression models conducted through both

**xtmelogit** and

**gllamm**. In the United Kingdom, 203 hospitals were compared on the quality of care received by 10,617 stroke patients through multilevel logistic regression models.

Results and considerations: Multilevel modeling and prediction are both computer-intensive, and so bootstrapping them is especially time-consuming. Examples from do-files with some helpful approaches will be shown. A small proportion of modal best linear unbiased predictors contained errors, possibly arising from the prediction algorithm. Various bootstrap confidence intervals exhibited problems such as excluding the point prediction and degeneracy. Methods for tracing the source will be presented.

Conclusion: Bootstrapping provides flexible but time-consuming inference for individual clusters’ predictions. However, there are potential problems that analysts should be aware of.

**Additional information**
UK11_Grant.ppt
### Sensitivity analysis for randomized trials with missing outcome data

Ian White

MRC Biostatistics Unit, Cambridge

Any analysis with incomplete data makes untestable assumptions about the missing data, and analysts are therefore urged to conduct sensitivity analyses. Ideally, a model is constructed containing a nonidentifiable parameter

*d*, where

*d* = 0 corresponds to the assumption made in the standard analysis, and the value of

*d* is then varied in a range considered plausible in the substantive context. I have produced Stata software for performing such sensitivity analyses in randomized trials with a single outcome, when the user specifies a value or range of values of

*d*. The analysis model is assumed to be a generalized linear model with adjustment for baseline covariates. I will describe the statistical model used to allow for the missing data, sketch the programming required to obtain a sandwich variance estimator, and describe modifications needed to make the results given when

*d* = 0 correspond exactly to those results available by standard methods. I will illustrate the use of the software for binary and continuous outcomes, when the standard analysis assumes either missing at random or (for a binary outcome) “missing = failure”.

**Additional information**
UK11_White.pdf
### Implementing the continual reassessment method (CRM)

Adrian Mander

MRC Biostatistics Unit Hub for Trials Methodology, Cambridge

One of the aims of a phase I trial in oncology is to find the maximum tolerated dose. A set of doses is administered to participants starting from the lowest dose in increasing steps. To do this safely, the toxicity of each dose is assessed, and a decision is made about whether to proceed with the next highest dose until the desired target toxicity level is found. A suitable dose is then chosen to take forward into phase II studies to discover whether this drug is efficacious. The majority of oncology phase I trials use algorithm-based rules such as the 3 + 3 design to escalate doses; the 3 + 3 design is easy to implement by nonstatisticians but is statistically inefficient. Other designs, such as the continual reassessment method (O’Quigley, Pepe, and Fisher 1990), use a model to help guide the decision of which dose to give. The complexity of the CRM and that it requires software may be reasons why it is not more widely used. This talk will describe a new command

**crm** that is a Mata implementation of the CRM and includes some discussion about the programming difficulties.

**Additional information**
UK11_Mander.pdf
### A review of estimators for the fixed-effects ordered logit model

Arne Risa Hole

University of Sheffield

Joint with Andy Dickerson and Luke Munford

It is well-known that the dummy variable estimator for the fixed-effects ordered logit model is inconsistent when

*T*, the dimension of the panel, is fixed. This talk will review a range of alternative fixed-effects ordered logit estimators that are based on Chamberlain’s fixed-effects estimator for the binary logit model. The talk will present Stata code for the estimators and discuss the available evidence on their finite-sample performance. We will conclude by presenting an empirical example in which the estimators are used to model the relationship between commuting and life satisfaction.

**Additional information**
UK11_Hole.pdf
### Generalized method of moments fitting of structural mean models

Tom Palmer

MRC CAiTE Centre, School of Social and Community Medicine, University of Bristol

Joint with Roger Harbord, Paul Clarke, and Frank Windmeijer

In this talk we describe how to fit structural mean models (SMMs), as proposed by Robins, using instrumental variables in the generalized method of moments (GMM) framework using Stata’s

**gmm** command. The GMM approach is flexible because it can fit overidentified models in which there are more instruments than endogenous variables. It also allows assessment of the joint validity of the instruments using Hansen’s

*J* test through Stata’s

**estat overid gmm** postestimation command. In the case of the logistic SMM, the approach also allows different first-stage association models. We show the relationship between the multiplicative SMM and the multiplicative GMM estimator implemented in the

**ivpois** command of Nichols (2007). For the multiplicative SMM, we show—analogously to Imbens and Angrist (1994) for the linear case—that the estimate is a weighted average of local estimates using the instruments separately. To demonstrate the models, we use a Mendelian randomization example, in which genotypes found to be robustly associated with risk factors from genome-wide association studies are used as instrumental variables, thereby investigating the effect of being overweight on the risk of hypertension in the Copenhagen General Population Study.

**Additional information**
UK11_palmer_handouts.pdf
UK11_palmer_presentation.pdf
### Flexible joint modeling of longitudinal and time-to-event data

Michael J. Crowther

Department of Health Sciences, University of Leicester

Joint with Keith R. Abrams and Paul C. Lambert

The joint modeling of longitudinal and time-to-event data has exploded in the methodological literature in the past decade; however, the availability of software to implement the methods lags behind. The most common form of joint model assumes that the association between the survival and longitudinal processes are underlined by shared random effects. As a result, computationally intensive numerical integration techniques such as Gauss–Hermite quadrature are required to evaluate the likelihood. We describe a new user-written command

**jm**, which allows the user to jointly model a continuous longitudinal response and an event of interest. We assume a linear mixed-effects model for the longitudinal submodel, thereby allowing flexibility through the use of fixed and/or random fractional polynomials of time. We also assume a flexible parametric model (

**stpm2**) for the survival submodel. Flexible parametric models are fitted on the log cumulative hazard scale, which has direct computational benefits because it avoids the use of numerical integration to evaluate the cumulative hazard. We describe the features of

**jm** through application to a dataset investigating the effect of serum albumin level on time to death from any cause in 252 patients suffering end-stage renal disease.

**Additional information**
UK11_crowther.pdf
### Sample size and power estimation when covariates are measured with error

Michael Wallace

London School of Hygiene and Tropical Medicine

Measurement error in exposure variables can lead to bias in effect estimates, and methods that aim to correct this bias often come at the price of greater standard errors (and so, lower statistical power). This means that standard sample size calculations are inadequate and that, in general, simulation studies are required. Our routine

**autopower** aims to take the legwork out of this simulation process, restricting attention to univariate logistic regression where exposures are subject to classical measurement error. It can be used to estimate the power of a particular model setup or to search for a suitable sample size for a desired power. The measurement error correction methods that are employed are regression calibration (

**rcal**) and a conditional score method—a Stata routine that we also introduce.

**Additional information**
UK11_wallace.ppt
### Splines models for prediction of house prices

David Boniface

Epidemiology and Public Health, University College London

Aim: To create a web-based facility for customers to enter an address of a house and obtain a graph showing the trend of price of house since last sold, extrapolated to current date, within milliseconds.

Method: The UK Land Registry of house sale prices was used to estimate mean price trends from 2000 to 2010 for each category of house. The Stata ado-file

**uvrs** (with user-specified knots) was used to model the curve. The parameter estimates were saved. Later, to respond in real time to a query about a particular house,

**splinegen** was used to generate the spline curve for the appropriate time period, which was adjusted to apply to the particular house and plotted on the webpage.

Challenges: use of coded date, choice of user knots for splines, saving and retrieving the knots and parameter estimates, use of log scale for prices to deal with skewed price distribution, estimation of prediction intervals, and the 2009 slump in house prices

**Additional information**
UK11_boniface.ppt
### Endogenous treatment effects for count data models with endogenous participation or sample selection

Alfonso Miranda

Institute of Education, University of London

Joint with Massimiliano Bratti

We propose an estimator for models in which an endogenous dichotomous treatment affects a count outcome in the presence of either sample selection or endogenous participation using maximum simulated likelihood. We allow for the treatment to have an effect on both the participation or the sample selection rule and on the main outcome. Applications of this model are frequent in—but not limited to—health economics. We show an application of the model using data from Kenkel (Kenkel and Terza, 2001,

*Journal of Applied Econometrics* 16: 165–184), who investigated the effect of physician advice on the amount of alcohol consumption. Our estimates suggest that in these data, a) neglecting treatment endogeneity leads to a wrongly signed effect of physician advice on drinking intensity, b) accounting for treatment endogeneity but neglecting endogenous participation leads to an upwardly biased estimate of the treatment effect, and c) advice affects only the drinking-intensive margin but not drinking prevalence.

**Additional information**
UK11_Miranda.pdf
### Multiple imputation with large proportions of missing data: How much is too much?

Jin Hyuk Lee

Texas A&M Health Science Center

Joint with John Huber Jr.

Multiple imputation (MI) is known as an effective method for handling missing data. However, it is not clear that the method will be effective when the data contain a high percentage of missing observations on a variable. This study examines the effectiveness of MI in data with 10% to 80% missing observations using absolute bias and root mean squared error of MI measured under missing completely at random, missing at random, and not missing at random assumptions. Using both simulated data drawn from multivariate normal distribution and example data from the Predictive Study of Coronary Heart Disease, the bias and root mean squared error using MI are much smaller than of the results when complete case analysis is used. In addition, the bias of MI is consistent regardless of increasing imputation numbers (M) from M = 10 to M = 50. Moreover, compared to the regression method and predictive mean matching method, the Markov chain Monte Carlo method can also be used for continuous and univariate missing variables as an imputation mechanism. In conclusion, MI produces less-biased estimates, but when large proportions of data are missing, other things need to be considered such as the number of imputations, imputation mechanisms, and missing data mechanisms for proper imputation.

**Additional information**
UK11_lee.pptx
### Testing the performance of the two fold FCS algorithm for multiple imputation of longitudinal clinical records

Irene Petersen

University College London

Joint with Catherine Welch, Jonathan Bartlett, Ian White, Richard Morris, Louise Marston, Kate Walters, Irwin Nazareth, and James Carpenter

Multiple imputation is increasingly regarded as the standard method to account for partially observed data, but most methods have been based on cross-sectional imputation algorithms. Recently, a new multiple-imputation method, the two fold fully conditional specification (FCS) method, was developed to impute missing data in longitudinal datasets with nonmonotone missing data. (See Nevalainen J., Kenward M.G., and Virtanen S.M. 2009. Missing values in longitudinal dietary data: A multiple imputation approach based on a fully conditional specification.

*Statistics in Medicine* 28: 3657–3669.) This method imputes missing data at a given time point based on measurements recorded at the previous and next time points. Up to now, the method has only been tested on a relatively small dataset and under very specific conditions. We have implemented the two fold FCS algorithm in Stata, and in this study we further challenge and evaluate the performance of the algorithm under different scenarios. In simulation studies, we generated 1,000 datasets, which were similar in structure to the longitudinal clinical records (The Health Improvement Network primary care database) to which we will apply the two fold FCS algorithm. Initially, these generated datasets included complete records. We then introduced different levels and patterns of partially observed data patterns and applied the algorithm to generate multiply imputed datasets. The results of our initial multiple imputations demonstrated that the algorithm provided acceptable results when using a linear substantive model and data were imputed over a limited time period for continuous variables such as weight and blood pressure. Introducing an exponential substantive model introduced some bias, but estimates were still within acceptable ranges. We will present results for simulation studies that include situations where categorical and continuous variables change over a 10-year period (for example, smokers become ex-smokers, weight increases or decreases) and large proportions of data are unobserved. We also explore how the algorithm deals with interactions and whether it has any impact on the final data distribution—whether the algorithm is initiated to run forward or backward in time.

**Additional information**
UK11_welch.pptx
### Implementing procedures for spatial panel econometrics in Stata

Gordon Hughes

School of Economics, University of Edinburgh

Econometricians have begun to devote more attention to spatial interactions when carrying out applied econometric studies. In part, this is motivated by an explicit focus on spatial interactions in policy formulation or market behavior, but it may also reflect concern about the role of omitted variables that are or may be spatially correlated. The classic models of spatial autocorrelation or spatial error rely upon a predefined matrix of spatial weights

*W*, which may be derived from an explicit model of spatial interactions but which, alternatively, could be viewed as a flexible approximation to an unknown set of spatial links similar to the use of a translog cost function. With spatial panel data, it is possible, in principle, to regard

*W* as potentially estimable, though the number of time periods would have to be large relative to the number of spatial panel units unless severe restrictions are placed upon the structure of the spatial interactions. While the estimation of

*W* may be infeasible for most real data, there is a strong, formal similarity between spatial panel models and nonspatial panel models in which the variance–covariance matrix of panel errors is not diagonal. One important variant of this type of model is the random-coefficient model in which slope coefficients differ across panel units so that interest focuses on the mean slope coefficient across panel units. In certain applications—for example, cross-country (macro-) economic data—the assumption that reaction coefficients are identical across panel units is not intuitively plausible. Instead of just sweeping differences in coefficients into a general error term, the random-coefficient model allows the analyst to focus on the common component of responses to changes in the independent variables while retaining the information about the error structure associated with coefficients that are random across panel units but constant over time for each panel unit.

At present, Stata’s spatial procedures include a range of user-written routines that are designed to deal with cross-sectional spatial data. The recent release of a set of programs (including

**spmat**,

**spivreg**, and

**spreg**) written by Drukker, Prucha, and Raciborski provides Stata’s users with the opportunity to fit a wide range of standard spatial econometric models for cross-sectional data. Extending such procedures to deal with panel data is nontrivial, in part because there are important issues about how panels with incomplete data should be treated. The casewise exclusion of missing data is automatic for cross-sectional data, but omitting a whole panel unit because some of the data in the panel are missing will typically lead to a very large reduction in the size of the working dataset. For example, it is very rare for international datasets on macroeconomic or other data to be complete, so that casewise exclusion of missing data will generate datasets that contain many fewer countries or time periods than might otherwise be usable.

The theoretical literature on econometric models for the analysis of spatial panels has flourished in the last decade with notable contributions from LeSage and Pace, Elhorst, and Pfaffermayr, among others. In some cases, authors have made available specific code for the implementation of the techniques that they have developed. However, the programming language of choice for such methods has been MATLAB, which is expensive and has a fairly steep learning curve for nonusers. Many of the procedures assume that there are no missing data and the procedures may not be able to handle large datasets because the model specifications can easily become unmanageable if either

*N* (the number of spatial units) or

*T* (the number of time periods) becomes large.

The presentation will cover a set of user-written maximum likelihood procedures for fitting models with a variety of spatial structures including the spatial error model, the spatial Durbin model, the spatial autocorrelation model, and certain combinations of these models—the terminology is attributable to LeSage and Pace (2009). A suite of MATLAB programs to fit these models for both random and fixed effects has been compiled by Elhorst (2010) and provides the basis for the implementation in Stata/Mata. Methods of dealing with missing data, including the implementation of an approach proposed by Pfaffermayr (2009), will be discussed. The problem of missing data is most severe when data on the dependent variable are missing in the spatial autocorrelation model because it means that information on spatial interactions may be greatly reduced by the exclusion of countries or other panel units. In such cases, some form of imputation may be essential, so the presentation will consider alternative methods of imputation. It should be noted that

**mi** does not support panel data procedures in general, and the relatively high cost of fitting spatial panel models means that it may be difficult to combine

**mi** with spatial procedures for practical applications.

A second aspect of spatial panel models that will be covered in the presentation concerns the links between such models and random-coefficient models that can be fit using procedures such as

**xtrc** or the user-written procedure

**xtmg**. The classic formulation of random-coefficient models assumes that the variance–covariance model of panel errors is diagonal but heteroskedastic. This is an implausible assumption for most cross-country datasets, so it is important to consider how it may be relaxed, either by allowing for explicit spatial interactions or by using a consistent estimator of the cross-country variance–covariance model.

The user-written procedures introduced in the presentation will be illustrated by applications drawn from analyses of demand for infrastructure, health outcomes, and climate for cross-country data covering the developing and developed world plus regions in China.

**Additional information**
UK11_hughes.pdf
### Structural equation modeling for those who think they don’t care

Vince Wiggins

StataCorp LP

We will discuss SEM (structural equation modeling), not from the perspective of the models for which it is most often used—measurement models, confirmatory factor analysis, and the like—but from the perspective of how it can extend other estimators. From a wide range of choices, we will focus on extensions of mixed models (random and fixed-effects regression). Extensions include conditional effects (not completely random), endogenous covariates, and others.

**Additional information**
UK11_Wiggins.pdf
### Chained equations and more in multiple imputation in Stata 12

Yulia Marchenko

StataCorp LP

I present the new Stata 12 command,

**mi impute chained**, to perform multivariate imputation using chained equations (ICE), also known as sequential regression imputation. ICE is a flexible imputation technique for imputing various types of data. The variable-by-variable specification of ICE allows you to impute variables of different types by choosing the appropriate method for each variable from several univariate imputation methods. Variables can have an arbitrary missing-data pattern. By specifying a separate model for each variable, you can incorporate certain important characteristics, such as ranges and restrictions within a subset, specific to each variable. I also describe other new features in multiple imputation in Stata 12.

**Additional information**
UK11_marchenko.pdf
### Exporting CAPI data to Stata: Experience from surveybe

Joachim De Weerdt

Economic Development Initiatives, Tanzania

Researchers typically spend significant amounts of time cleaning and labeling data files in preparation of analyses of survey data. Computer-assisted personal interviewing (CAPI) gives the ability to automate this process. First, consistency checks can be run during the interview so that only data that passes autogenerated and user-written validation tests comes back from the field. Second, CAPI allows for the autogeneration of a Stata do-file that labels data files. This presentation discusses the Stata export procedure used by

**surveybe**, a CAPI application designed to handle complex surveys. The questions, as displayed on the screen to the interviewer, are automatically turned into variable labels. Likewise, the drop-down menus are autocoded as value labels. Furthermore, the export procedure ensures that data from rosters get exported into different Stata data files and that complete referential integrity is ensured across all the files originating from the same survey, with unique primary keys linking files together. Any changes made to the electronic questionnaire (for example, adding a response code to the drop-down menu) or changes to the phrasing of a question will be automatically incorporated into the exported data files, thus ensuring that the data files match the questionnaires completely.

**Additional information**
UK11_deweerdt.pdf
### Using Mata to import Illumina SNP chip data for genome-wide association studies

J. Charles Huber Jr.

Texas A&M University

Joint with Michael Hallman, Victoria Friedel, Melissa Richard and Huandong Sun

Modern genetic genome-wide association studies typically rely on single nucleotide polymorphism (SNP) chip technology to determine hundreds of thousands of genotypes for an individual sample. Once these genotypes are ascertained, each SNP alone or in combination is tested for association outcomes of interest such as disease status or severity. Project Heartbeat! was a longitudinal study conducted in the 1990s that explored changes in lipids and hormones and morphological changes in children from 8 to 18 years of age. A genome-wide association study is currently being conducted to look for SNPs that are associated with these developmental changes. While there are specialty programs available for the analysis of hundreds of thousands of SNPs, they are not capable of modeling longitudinal data. Stata is well equipped for modeling longitudinal data but cannot load hundreds of thousands of variables into memory simultaneously. This talk will briefly describe the use of Mata to import hundreds of thousands of SNPs from the Illumina SNP chip platform and how to load those data into Stata for longitudinal modeling.

**Additional information**
UK11_Huber.pptx
### Using Stata for handling CDISC datasets

Adam Jacobs

Dianthus Medical Limited

The Clinical Data Interchange Standards Consortium (CDISC) is a globally relevant nonprofit organization that defines standards for handling data in clinical research. It produces a range of standards for clinical data at various stages of maturity. One of the most mature standards is the Study Data Tabulation Model, which provides a standardized yet flexible data structure for storing entire databases from clinical trials. A related standard is the Analysis Dataset Model, which defines datasets that can be used for analyzing data from clinical trials. I shall explain how the CDISC standards work, how Stata can simplify many of the routine tasks encountered in handling CDISC datasets, and the great efficiencies that can result from using datasets in a standardized structure.

**Additional information**
UK11_jacobs.ppt
### Picturing mobility: Transition probability color plots

Philippe Van Kerm

CEPS/INSTEAD, Luxembourg

This talk presents a simple but effective graphical device for visualization of patterns of income mobility. The device in effect uses color palettes to picture information contained in transition matrices created from a fine partition of the marginal distributions. The talk explains how these graphs can be constructed using the user-written package

**spmap** from Maurizio Pisati, briefly presents the wrapper command

**tpcplot** (for transition probability color plots) and demonstrates how such graphs are effective for contrasting patterns of mobility in different countries or contrasting observed patterns against benchmarks of maximal or minimal mobility.

**Additional information**
UK11_vankerm.pdf
### Running multilevel models in MLwiN from within Stata: runmlwin

George Leckie

Centre for Multilevel Modelling, University of Bristol

Joint with Chris Charlton

Multilevel analysis is the statistical modeling of hierarchical and nonhierarchical clustered data. These data structures are common in social and medical sciences. Stata provides the

**xtmixed**,

**xtmelogit**, and

**xtmepoisson** commands for fitting multilevel models, but these are only relevant for univariate continuous, binary, and count response variables, respectively. A much wider range of multilevel models can be fit using the user-written

**gllamm** command, but

**gllamm** can be computationally slow for large datasets or when there are many random effects. Many Stata users therefore turn to specialist multilevel modeling packages such as MLwiN for fast fitting of a wide range of complex multilevel models. MLwiN includes the following features: fitting of multilevel models for

*n*-level hierarchical and nonhierarchical data structures; fast fitting via classical and Bayesian methods; fitting of multilevel models for continuous, binary, ordered categorical, unordered categorical, and count data; fitting of multilevel multivariate response models, spatial models, measurement error models, multiple-imputation models, and multilevel factor models; interactive model equation windows and graph windows for model exploration; and availability that is free to academics in the United Kingdom. In this presentation, we will introduce the

**runmlwin** command to fit multilevel models in MLwiN from within Stata and to return estimation results to the Stata environment. We shall demonstrate

**runmlwin** in action with several example multilevel analyses in which we fit models and use Stata’s standard postestimation commands such as

**predict** and

**test** to calculate predictions, perform hypothesis tests, and produce publication-quality graphics.

**Additional information**
UK11_leckie.do
UK11_leckie.pdf
### Plagiarism in student papers and cheating in student exams: Results from surveys using special techniques for sensitive questions

Ben Jann

University of Bern

Eliciting truthful answers to sensitive questions is an age-old problem in survey research. Respondents tend to underreport socially undesired or illegal behaviors while overreporting socially desirable ones. To combat such response bias, various techniques have been developed that are geared toward providing the respondent greater anonymity and minimizing the respondent’s feelings of jeopardy. Examples of such techniques are the randomized response technique, the item-count technique, and the crosswise model. I will present results from several surveys, conducted among university students, that employ such techniques to measure the prevalence of plagiarism and cheating in exams. User-written Stata programs for analyzing data from such techniques are also presented.

**Additional information**
UK11_jann.pdf
### Lowering your handicap with Stata

Tim Collier

London School of Hygiene and Tropical Medicine

When I first met Stata in October 2000, my golf handicap was 27 and my game was going nowhere slowly. Ten years of intensive Stata therapy later, my handicap is 17.3 and falling. It would, of course, be nonsense to infer from this data that lowering your handicap increases Stata use, but could the reverse be true? Could there be a causal relationship between increasing Stata use and a decreasing handicap? In this presentation, I argue that, yes, there is. Granted, Stata might not work along the lines of traditional golf training aids, but rather its effect is mediated through a third factor, namely time. Golf consumes time. Stata produces time. In this presentation, I will demonstrate how minutes in Stata’s programming world are equivalent to hours in the real world, and by the use of programs within programs, minutes can translate to days. Although extrapolation from an

*N* of 1 is nearly always dangerous, I believe that Stata could be similarly used to reduce your weight, improve foreign language skills, or even increase research output.

**Additional information**
UK11_Collier.ppt
### Fun and fluency with functions

Nicholas J. Cox

Durham University

Functions in Stata range between those you know you want and those you don’t know you need. The word “functions” is heavily overloaded in Stata; here the focus is on functions’ strict sense, _variables, extended macro functions, and

**egen** functions. Often Stata users in difficulty are seeking commands or imagining that they need to write programs, when a few lines of code using functions would crack their problem. In this talk, I will briefly give some general advice on using functions and in more detail discuss a variety of examples, with the aim of introducing something unappreciated but useful to almost everyone. Somehow or other, graphs and my own work will also be mentioned.

**Additional information**
UK11_Cox_functions.html
UK11_Cox_functions.smcl
### Panel time-series modeling: New tools for analyzing xt data

Markus Eberhardt

University of Oxford

Stata already has an extensive range of built-in and user-written commands for analyzing

**xt** (cross-sectional time-series) data. However, most of these commands do not take into account important features of the data relating to their time-series properties or cross-sectional dependence. This talk reviews the recent literature concerned with these features with reference to the types of data in which they arise. Most of the talk will be spent discussing and illustrating various Stata commands for analyzing these types of data, including several new user-written commands. The talk should be of general interest to users of

**xt** data and of particular interest to researchers with panel datasets in which countries or regions are the unit of analysis and there is also a substantial time-series element. Over the past two decades, a literature dedicated to the analysis of macro panel data has concerned itself with some of the idiosyncrasies of this type of data, including variable nonstationarity and cointegration, as well as with the investigation of possible parameter heterogeneity across panel members and its implications for estimation and inference. Most recently, this literature has turned its attention to concerns over cross-sectional dependence, which can arise either in the form of unobservable global shocks that differ in their impact across countries (for example, the recent financial crisis) or as spillover effects (again, unobservable) between a subset of countries or regions.

**Additional information**
UK11_eberhardt.pdf