StataCorp, College Station, TX
Stata 14 provides a suite of commands for performing Bayesian analysis. Bayesian analysis is a statistical paradigm that answers research questions about unknown parameters using probability statements. For example, what is the probability that a person accused of a crime is guilty? What is the probability that there is a positive effect of schooling on wage? What is the probability that the odds ratio is between 0.3 and 0.5? And many more. In my presentation, I will describe Stata's Bayesian suite of commands and demonstrate its use in various applications.
Department of History and Sociology, University of Konstanz, Germany
Log-linear models for cross-tabulations are models for describing and testing patterns in cross-tabulations. These cross-tabulations could have two dimensions (e.g. father’s occupation versus son’s occupation) or more than two dimensions (e.g. father’s occupation versus son’s occupation for different cohorts and different countries). A wide range of patterns can be investigated and tested with these models. Some examples of these patterns are: one can investigate whether the dimensions are independent (e.g. father’s occupation has no relevance for the son’s occupation), whether the dimensions are independent except for the diagonals (e.g. sons are more likely to enter the occupation of their father, but the father has no influence once the son chooses to do something else than the father) or assume that the categories are ordinal and estimate a scale for each dimension and summarize the strength of the association with one number, which can be compared across cohorts or countries. The purpose of this talk is to give an overview of this family of models, discuss how to trick Stata (in particular, poisson and gsem) into estimating these models, and how to get interpretable parameters out of these models.
Institute for Analytical Sociology, Linköping University, Sweden
The field of social network analysis is one of the most rapidly growing fields of the social sciences. Social network analysis focuses on the relationships that exist between individuals (or other units of analysis) such as friendship, advice, trust, or trade relationships. Network analysis is concerned with the visualization and analysis of network structures, as well as with the importance of networks for individuals’ propensities to adopt different kinds of behaviors. Up until now such analyses have only been possible to perform using specialized software for network analysis. This tutorial introduces the so-called nwcommands, a software suite with over 80 Stata commands for social network analysis. The software includes commands (and dialog boxes) for importing, exporting, loading, saving, handling, manipulating, replacing, generating, visualizing, and animating networks. It also includes commands for measuring various properties of the networks and the individual nodes, for detecting network patterns and measuring the similarity of different networks, as well as advanced statistical techniques for network analysis including MR-QAP and ERGM.
Christopher F. Baum
Boston College, USA and DIW Berlin, Germany
Stata 13 added a very important feature for macroeconomists: the forecast suite of commands that implements the definition of a model, consisting of a number of estimated equations and potentially nonlinear identities. Stata’s features include model solution, dynamic forecasting, scenario analysis and stochastic simulation. I report on my attempt to apply the forecast suite to a well-known large-scale macroeconomic model. I discuss the challenges related to use of these features in a much more complex context than that illustrated in the manual’s examples. I will also suggest a number of enhancements that would improve forecast’s capabilities in comparison to other popular forecasting tools.
Department of History and Sociology, University of Konstanz, Germany
There is increasing critisism of the ways in which the raw coefficients and odds ratios from logistic regression have been used. The argument is that logistic regression models a latent propensity of success and that the scale of that latent variable is fixed by fixing the variance of the error term. If one adds a variable to a model, the variance of the residual is likely to decrease, and the scale of the dependent variable thus changes. Comparing models with and without that additional variable thus becomes problematic. Similarly, a comparison of models in groups that are likely to have different residual variances will also be problematic. However, I will argue that logistic regression has an unusual dependent variable: a probability, which measures how certain we are that an event of interest happens. This degree of certainty is a function of how much information we have, which in case of logistic regression is captured by the variables we add to the model. If the dependent variable is interpreted in that way many of the problems with logistic regression turn out to be desirable properties of the logistic regression model.
Spatial Economics & Econometrics Centre (SEEC), Heriot-Watt University, UK
Robert L. Hicks
College of William and Mary, Williamsburg VA, USA
Kurt E. Schnier
University of California Merced CA, USA
Agents may consider information and other signals from their peers (especially close peers) when making their spatial site choices. However, the presence of other agents in a spatial location may generate congestion or agglomeration effects. Disentangling the potential peer effects with issues of congestion is difficult since it is hard to ascertain whether the observed congestion effects are a result of observing others behavior or the influence of peer effects within the same network encouraging a fisherman to visit a site even in the presence of congestion. The research develops an empirical framework to decompose both motivations in a spatial discrete choice model in an effort to synthesize the congestion/agglomeration literature with the peer effects literature. Using Monte Carlo analysis we investigate the robustness of our proposed estimation routine to the conventional random utility model (RUM) that ignores both peer and congestion/agglomeration effects as well as the spatial sorting equilibrium model that ignore peer effects. Our results indicate that both the RUM and sorting equilibrium models can be used to successfully investigate the presence of a peer effects. However, the estimates of congestion effects are poor because of ignored correlated random effects. Recent literature has largely used Bayesian methods for this hard problem. We also explore the use of Fixed Effects Multinomial Logit estimates to first estimate the base model, and then extract generalized residuals to estimate the peer effects.
Health Behaviour Research Centre, UCL
This paper illustrates the use of a recently developed Stata procedure ipdpower (Kontopantelis, E.) in designing a cluster randomised trial. The trial required to compare change pre and post between intervention and non-intervention care homes. Forty nine residential care homes ranging in size from 3 to 112 beds (median 27 beds) were available to take part. Primary outcome measures were tooth cleaning (a dichotomy) and the Geriatric Oral Health Assessment Index (GOHAI, a continuous score). As is common in this situation it was required to explore the effect on sample size and power of a range of values of cluster sizes, within cluster correlation, between group variation, and intraclass correlation. Ranges of parameter values for a number of runs of the simulation procedure were obtained from published results of studies with similar features, transformed where necessary through standard formulae. The final design resulted in a recommendation of use of 16 homes with estimated statistical power of 80% for comparison of intervention with non-intervention participants, adjusting for baseline values. Simulation can be recommended as a valuable approach since it takes account of all features of the design, it facilitates communication among members of the study team in balancing design features and it provides a clear sense of the size required for the necessary statistical power.
Nicholas J. Cox
Durham University, UK
Time series (and similar one-dimensional series) are more often irregularly spaced than many methods texts or courses admit. Even with a plan of regular measurements, gaps can arise for many human or inhuman reasons, while some series are naturally irregular. Interpolation of values between known values is a centuries-old need, but one neglected by official Stata, which offers only linear interpolation and cubic spline interpolation (in Mata). I review additional user-written commands for interpolation, including those for cubic, nearest neighbour and piecewise cubic Hermite methods available from SSC. Beyond interpolation of irregular series lie the questions of characterising the structure of such series and smoothing in various ways. One useful tool standard in spatial statistics is the variogram, which relates dissimilarity as squared differences between values to their separation in time or distance in space. Diggle and others have shown uses for variograms in time series and longitudinal data analysis. I discuss user-written Stata commands for variogram calculation, plotting and use in relation to exploratory data analysis on the one hand and smoothing on the other.
Research Institute on Sustainable Economic Growth, Rome
This paper presents rscore, a Stata module to compute unit responsiveness scores using a iterated random coefficient regression (RCR). The basic econometrics of this model can be found in Wooldridge (2002, pp. 638-642). The model estimated by rscore starts from a classical regression of Y, the target variable, on a series of factors X (the regressors), by assuming a different reaction (or responsiveness) of each unit to each factor contained in X. This is done by using a random coefficient regression (RCR), an approach in which the usual regression coefficients vary across units. The application of such an approach can convey new and interesting analytical findings compared to the traditional regression approach. In particular, by measuring a unit-specific regression coefficient for each regressor this model allows for: (i) ranking units according to the level of the responsiveness score obtained; (ii) detecting factors that are more influential in driving unit performance; (iii) studying, more in general, the distribution (variety) of the factors’ responsiveness scores across units. The knowledge of these idiosyncratic scores can be also exploited to test the presence of increasing, constant, or decreasing returns of Y to X in a straightforward and graphically easy-to-read way.
St George's, University of London, UK and Kingston University, UK
Over the last three years, a new package for Bayesian modelling called Stan (after Stanislaw Ulam, co-author of the Metropolis algorithm) has been developing quickly and making an impact on computing for complex Bayesian models. By translating the model into C++ and then compiling that, it can run much faster than BUGS. A particular benefit is for simulation studies, because the model only needs to be compiled once. Furthermore, it includes a much faster and better mixing algorithm (NUTS: the No U-Turn Sampler), especially for correlated parameters that Gibbs samplers like BUGS cope with badly. I present a program StataStan, which sends your data and specifications to Stan, displays results, and can read the chains of samples back into Stata. There are also specific commands to run the commonly used models in the BUGS and Stan user manuals with your own data, avoiding the need to write the Stan model.
Michael J. Grayling
MRC Biostatistics Unit Cambridge
The normal distribution holds significant importance in statistics. Much gathered real world data either is, or is assumed to be, normally distributed. Today though, a considerable amount of statistical analysis performed is not univariate, but multivariate in nature. Consequently, the multivariate normal distribution is of increasing importance. However, the complexity of this distribution makes computational analysis almost certainly necessary, and thus much research has been conducted in to developing efficient algorithms for its numerical analysis. Here we discuss our implementation of a certain choice of algorithm in Mata that allows its distribution function and equi-coordinate quantiles to be identified seamlessly for any choice of location vector and positive semi-definite covariance matrix. Moreover, we detail new commands to efficiently compute its density and to generate pseudo-random variables. We then discuss the performance of our commands relative to the presently available alternatives, and present how they provide greater generalisation and improved computational speed. Finally, through the example of designing a group sequential clinical trial, we demonstrate how our commands can be used easily to solve real-world problems facing Stata users.
University of Bern, Switzerland
Percentile shares provide an intuitive and easy-to-understand way for analyzing income or wealth distributions. A celebrated example are the top income shares sported by the works of Thomas Piketty and colleagues. Moreover, series of percentile shares, defined as differences between Lorenz ordinates, can be used to visualize whole distributions or changes in distributions. In this talk I present a new command called pshare that computes and graphs percentile shares (or changes in percentile shares) from individual level data. The command also provides confidence intervals and supports survey estimation.
Deanna Jannat-Khah and Michelle Unterbrink
Weill Cornell Medical College, New York
Co-authors: Margaret McNairy, Samuel Pierre, Dan Fitzgerald, Jean Pape, Arthur Evans
Loss to follow-up is unavoidable in many public health studies. Tracing all subjects may be impractical or prohibitively expensive. Traditional methods, including Kaplan-Meier analysis and inverse probability weighting (IPW), produce biased estimates if loss is not independent of survival. Multiple imputation with chained equations (MICE) provides an acceptable, robust and cost saving solution to this problem for HIV research in developing countries with limited resources. To illustrate utility, we applied MICE to ascertain outcome status of people who were lost to follow up within a cohort of N=910 HIV positive people followed for ten years in Port au Prince Haiti, 17% (n = 156) were lost to follow-up and 8% (n = 71) transferred facilities. Contact tracing was performed and 45 of the 156 subjects identified as lost to follow-up were found; 37 alive and 8 deceased. Analysis using IPW based on the traced subjects predicted that 63% of all subjects were alive at 10 years (95% CI 0.59-0.67). Results from MICE predicted that within 6 months 12% (95% CI 0.86-0.90) of those who were lost to follow-up or transferred were dead and 88% were alive (95% CI 0.10-0.14). At 10 years, 33% were predicted to be dead (95% CI 0.29-0.36) and 67% (95% CI: 0.64-0.71) were predicted to be alive. We found MICE to be more robust in predicting status as it allowed us to impute missing data so that we had the maximum number of observations to perform regression analyses. Additionally, the results were easier to interpret, less likely to be biased, and provided an interesting insight into a problem that is often commented upon in the extant literature. Overall MICE is a useful cost saving method for studying survival compared to contact tracing for HIV research in developing countries.
Behavioral Research Group | Quantitative Risk Management
With more and more data being stored by organizations across industries – from academia, to health care, to banking – along with plummeting storage and RAM costs, there is a growing need for tools to analyze “big data”. The world is moving from needing to analyze megabytes of data to needing to analyze many gigabytes. While Stata is very user-friendly, many of the most basic commands – summarize, sample, collapse, and encode, etc – are not optimized for speed. These commands – as of Stata 14 – all rely on sorting, making them tens, or even hundreds (in the case of sample), of times slower than what is possible with better algorithms. In this presentation I illustrate alternative algorithms along with coded examples in Stata, Mata, and C++ plugins which may be used to more quickly analyze big data. fastsample and fastcollapse are available from the SSC.
Tim P. Morris
MRC Clinical Trials Unit, University College London, UK
Co-author: Babak Choodari-Oskooei
Statisticians and econometricians developing new methods are keen for their methods to be adopted, and releasing user-friendly software plays an important role in uptake. Methods that were not initially applied much, and became so after software implementations, include Cox’s proportional-hazards model, multiple imputation and propensity score matching. It is easy to release packages to the Stata community via the Boston College Statistical Software Components (SSC) archive, but gauging the uptake can be difficult. Stata’s ssc hot command lists the number of hits for a recent month for packages available on SSC. The new ssccount command goes further, obtaining monthly files of hits (from July 2007 when records began) for specified authors and packages, and optionally plots the number of hits over time. This can give authors an impression of how well their commands are being used. Funders are increasingly asking for evidence of impact, and thus ssccount provides a useful soft measure.
Roger B. Newson
Department of Primary Care and Public Health, Imperial College London
Somers’ D(Y|X) is an asymmetric measure of ordinal association between two variables Y and X, on a scale from –1 to 1. It is defined as the difference between the conditional probabilities of concordance and discordance between two randomly-sampled (X,Y)-pairs, given that the two X-values are ordered. The somersd package enables the user to estimate Somers’ D for a wide range of sampling schemes, allowing clustering and/or sampling-probability weighting and/or restriction to comparisons within strata. Somers’ D has the useful feature that a larger D(Y|X) cannot be secondary to a smaller D(W|X) with the same sign, enabling us to make scientific statements that the first ordinal association cannot be caused by the second. An important practical example, especially for public-health scientists, is the case where Y is an outcome, X an exposure, and W a propensity score. However, an audience accustomed to other measures of association may be culture-shocked, if we present associations measured using Somers’ D. Fortunately, under some commonly-used models, Somers’ D is related monotonically to an alternative association measure, which may be more clearly related to the practical question of how much good we can do. These relationships are nearly linear (or log-linear) over the range of Somers’ D values from –0.5 to 0.5. We present examples with X and Y binary, with X binary and Y a survival time, with X binary and Y conditionally Normal, and with X and Y bivariate Normal. Somers’ D can therefore be used as a common currency for comparing a wide range of associations between variables, not limited to a particular model.
Tra M. Pham
Department of Primary Care & Population Health, University College London
Tim P. Morris
MRC Clinical Trials Unit, University College London, UK
Department of Primary Care and Population Health, University College London, UK
Ethnicity is an important factor to be considered in many epidemiological studies because of its association with inequality in disease prevalence and the utilisation of healthcare. Ethnicity recording has been incorporated in primary care electronic health records, and therefore is available in a number of large UK primary care databases such as The Health Improvement Network (THIN). However, since primary care data are routinely collected to serve clinical purposes, a large amount of data that are relevant for research purposes including ethnicity is often missing. A popular approach is to use multiple imputation, but the standard multiple imputation does not give plausible estimates of the ethnicity distribution in THIN compared to the general UK population. However, census data can be utilised to form weights to use in multiple imputation such that the correct ethnicity distribution is recovered. I will describe how the method of weighted multiple imputation of missing data is implemented using the Stata’s mi impute suite, note some issues, and introduce a new procedure to implement the method for multiple incomplete variables which require different imputation weights. Finally, I will give an example showing how the method works when ethnicity is used as an explanatory variable in a cohort study.
Joao M.C. Santos Silva
University of Essex
Quantile regression is increasingly used by practitioners, but there are still some misconceptions about how difficult it is to obtain valid standard errors in this context. In this presentation I discuss the estimation of the covariance matrix of the quantile regression estimator, focusing special attention on the case where the regression errors may be heteroskedastic and/or “clustered”. Specification tests to detect heteroskedasticity and intra-cluster correlation are discussed, and small simulation studies illustrate the finite sample performance of the tests and of the covariance matrix estimators. The presentation concludes with a brief description of qreg2, which is a wrapper for qreg that implements all the methods discussed in the presentation.
Philippe Van Kerm
Luxembourg Institute of Socio-Economic Research (LISER)
This presentation illustrates three practical uses of influence functions (IF) in Stata. First (and most obviously), inspection of IFs helps detecting influential sample observations. I show how this can be done in practice and how similar this is to examining jackknife replicates. Second, IFs make it easy to calculate (asymptotic) standard errors and confidence intervals for a wide range of statistics. I illustrate how this can be done in Stata with the total command so as to account for complex survey design easily. Third and finally, application of ‘recentered influence function (RIF) regression’ has recently been advocated to approximate the impact of covariates on (unconditional) distribution statistics. I demonstrate this use of IFs in Stata and discuss interpretation of RIF regression model coefficients. Empirical applications are to income distribution analysis. Several user-written utilities and commands are illustrated along the way.
Université Libre de Bruxelles, Belgium and Université de Namur, Belgium
Co-author: Brian O’Rourke
Final economic outcomes are often determined over consecutive process stages. The most prevalent approach is to model inter-nodal transition/event probabilities using techniques such as sequential logit. Transition success for survivors at each stage is then regressed on explanatory variables using standard logit (allowing for correlation in the error-terms). This seemingly un-related approach benefits from methodological convenience. It crucially depends, however, on the assumption that at each stage, any un-observable factors are independent. We believe that error term independence may often be an excessively strong assumption. We propose an alternative approach based on multinomial probit that does not rely on that very restrictive assumption. Implementation is no more demanding. We describe the procedure using Stata 13. To illustrate the usefulness of the method, we estimate the determinants of success for each stage at the Rugby World Cup.
Department of Electronic Engineering, Technical University of Madrid
William Gould and colleagues
StataCorp, College Station, TX
StataCorp representatives will be given the floor, aiming to report on recent developments at StataCorp, and to discuss wishes, grumbles, and suggestions for further development with users.