|
10th London Stata Users Group Meeting Monday 28th June 0930 1830 Dinner (Optional) Tuesday 29th June 0930 Synopsis of Papers Analysing linked employer-employee data with Stata Richard Upward, School of Economics, University of Nottingham The use of datasets which contain information on both workers and the firms they work for is growing rapidly, especially in fields such as applied econometrics and labour economics. Similar data structures may also arise in the analysis of data on patients and doctors, or students and schools. Many of these datasets are extremely large, some containing a substantial fraction of the population of firms and workers. The analysis of this kind of data poses two related problems. The first is a problem of computing power, memory and storage. The second is the statistical problem of how to control for and estimate the "unobserved effects" (also known as "fixed effects") for both workers and firms. In this presentation we explain the basic issues and how we have dealt with them using Stata. We illustrate using both simulated data and a large linked employer-employee panel collected by the Institut für Arbeitsmarkt und Berufsforschung in Germany. We show how to implement various potential methods, and suggest problems and limitations which the analyst using Stata may encounter. Abowd, J. and Kramarz, F. 1999. The analysis of labor markets using matched employer-employee data. In Ashenfelter, O. and Card, D. (eds.) Handbook of Labor Economics Volume 3. North-Holland. Approximating the bias of the LSDV estimator for dynamic panel data models Giovanni S.F. Bruno, Università Commerciale Luigi Bocconi, Milano It is well known that the LSDV estimator for dynamic panel data models is not consistent for N large and finite T. Nickell (1981) derives an expression for the inconsistency for N®+¥, which is O(1/T). Kiviet (1995) uses asymptotic expansion techniques to approximate the small sample bias of the LSDV estimator to also include terms of at most order 1/NT, thus offering a method to correct the LSDV estimator for samples where N is small or only moderately large. In Kiviet (1999) and Bun and Kiviet (2003) the bias expression is more accurate, including higher order terms. Monte Carlo evidence in Judson and Owen (1999) strongly supports the corrected LSDV estimator compared to more traditional GMM estimators when N is only moderately large. Bruno (2004) extends the bias approximation formulas in Bun and Kiviet (2003) to accommodate unbalanced panels with a strictly exogenous selection rule. This paper describes the Stata codes used in Bruno (2004) to compute the bias approximations and carry out the Monte Carlo experiment estimating the actual LSDV bias for various data generating processes. The analysis covers both balanced and unbalanced panels. It is found that the actual bias as estimated by Monte Carlo replications, besides following the same patterns as in Bun and Kiviet (2003), turns out non-increasing in the degree of unbalancedness. Moreover, the approximations are always accurate with a decreasing contribution to the actual bias of the higher order terms. Bruno, G.S.F. 2004. Approximating the bias of the LSDV estimator for unbalanced dynamic panel data models. mimeo. Bun, M.J.G. and Kiviet, J.F. 2003. On the diminishing returns of higher order terms in asymptotic expansions of bias. Economics Letters 79: 145-152. Judson, R.A. and Owen, A.L. 1999. Estimating dynamic panel data models: a guide for macroeconomists. Economics Letters 65: 9-15. Kiviet, J.F. 1995. On bias, inconsistency and efficiency of various estimators in dynamic panel data models. Journal of Econometrics 68: 53-78. Kiviet, J.F. 1999. Expectation of expansions for estimators in a dynamic panel data model: some results for weakly exogenous regressors. In Hsiao, C., Lahiri, K., Lee, L-F, and Pesaran, M.H. (eds.) Analysis of Panel Data and Limited Dependent Variables. Cambridge: Cambridge University Press. Nickell, S.J. 1981. Biases in dynamic models with fixed effects. Econometrica 49: 1417-1426. Multiple imputation of missing data: an implementation of van Buuren's MICE, and more Patrick Royston, MRC Clinical Trials Unit, London Following the seminal publications of Rubin starting about 30 years ago, statisticians have become increasingly aware of the inadequacy of `complete case' analysis of datasets with missing observations. In medicine, for example, observations may be missing in a sporadic way for different covariates; and a complete-case analysis may omit as many as half of the available cases. `Hotdeck' imputation was implemented in Stata by Mander and Clayton (1999). However, this technique may perform poorly in the common case when many rows of data have at least one missing value. In this talk, I will describe an implementation for Stata of the `MICE' method of multiple multivariate imputation described by van Buuren et al. (1999) (see also www.multiple-imputation.com). MICE stands for Multivariate Imputation by Chained Equations. The basic idea of data analysis with multiple imputation is to create a small number (e.g. 3-5) copies of the data, each of which has the missing values suitably imputed. Then, each complete dataset is analysed independently. Estimates of parameters of interest are averaged across the copies to give a single estimate. Standard errors are computed according to the `Rubin rules' (Rubin 1987), devised to allow for the between- and within-imputation components of variation in the parameter estimates. In the talk, I will present briefly five ado-files. mvis creates multiple multivariate imputations. uvis imputes missing values for a single variable as a function of several covariates, each with complete data. micombine fits a wide variety of regression models to a multiply imputed dataset, combining the estimates using Rubin's rules. micombine supports survival analysis models (stcox and streg), categorical data models, generalised linear models, and more. Finally, misplit and mijoin are utilities to inter-convert datasets created by mvis and by Carlin et al. (2003)'s miset routine. The use of the routines will be illustrated by example. Carlin, J.B., Li, N., Greenwood, P., Coffey, C. 2003. Tools for analyzing multiple imputed datasets. Stata Journal 3: 226-244. Mander, A. and Clayton, D. 1999. Hotdeck imputation. Stata Technical Bulletin 51: 32-34. Rubin, D.B. 1987. Multiple Imputation for Non-response in Surveys. New York: John Wiley. van Buuren, S., Boshuizen, H.C., Knook, D.L. 1999. Multiple imputation of missing blood pressure covariates in survival analysis. Statistics in Medicine 18: 681-694. Smooth hazard functions for survival time data Margaret May, Department of Social Medicine, University of Bristol In medical prognosis based on survival analysis, there is an interest in visualizing the shape of the hazard function. In fully parametric models, the shape of the hazard function is constrained by the properties of the chosen distribution (Weibull, log-logistic, lognormal, Gompertz, gamma). The semi-parametric Cox model only assumes proportional hazards and has no specification of the baseline hazard. In Stata 8 a method of illustrating hazard functions using a kernel smooth of the hazard contributions is implemented for the Cox model which will allow more flexibile shapes. However, if the proportional hazards assumption is violated, then a method based on smoothing the Nelson-Aalen cumulative hazard function followed by numerical differentiation to give the hazard function and then further kernel density smoothing of the resulting function may be useful. This method will be illustrated using data from ART-CC, an international collaboration of 12 cohorts with data on over 19,000 HIV positive patients. The hazard of AIDS or death by risk factor groups defined by initial CD4 count (a measure of immune system functioning) or injection drug use (IDU) is modelled from the time of starting antiretroviral therapy for up to 5 years. A new Stata command for estimating confidence intervals for the variance component of random-effects linear models Matteo Bottai1,2 and Nicola Orsini2,3 The Stata command xtreg estimates the random-effects linear regression model, for which the random effects are assumed to be normally distributed with zero mean and non-negative variance, su2. Testing homogeneity across units is equivalent to testing the null hypothesis H0:su2 = 0, which is a value on the boundary of the parameter space. The command xtreg provides the upper-tail probability of the appropriate asymptotic distribution of the likelihood ratio test statistic. However, such a method cannot be used to construct confidence intervals for the parameter, su2. Besides, confidence intervals for the random-effect variance that are based on a Wald-type test, too often used, can be shown to be asymptotically wrong. Based on the asymptotic theory for singular information problems, a method is developed and implemented in the Stata command xtci, which provides asymptotically-correct confidence intervals. Also, when testing the hypothesis of homogeneity across units, the proposed method is shown to have better small-sample properties than one based on the likelihood ratio test statistic. Bottai, M. and Orsini, N. 2004. Confidence intervals for the variance component of random-effects linear models. Stata Journal 4: in press Bottai, M. 2003. Confidence regions when the Fisher information is zero. Biometrika 90(1): 73-84. A comment on infrequency of purchase models in Stata Julian A. Fennema, Centre for Economic Reform and Transformation, Heriot-Watt University This paper introduces the dhurdle command for Stata, a maximum-likelihood routine (d2) to estimate the Cragg double hurdle model with either independent or dependent errors. We give a brief description of the procedure and its application to durable goods consumption and market participation models. We briefly demonstrate the construction of the program, and present evidence of its consistency. We compare its efficiency to the results reported by Flood and Gråsjö for the routine programmed in Gauss, and repeat their tests of the effect of misspecification on the parameter estimates. We also outline extensions in the pipeline, particularly the inverse hyperbolic sine heteroskedasticity correction, but also invite suggestions. Flood, L. and Gråsjö, U. 2001. A Monte Carlo Simulation of Tobit models. Applied Economics Letters 8: 581-584. Topics in time series regression modeling Christopher F. Baum, Department of Economics, Boston College, Boston, MA This talk will discuss the use of a number of Stata commands, some "official" and some user-contributed, in the context of working with time-series and panel data. Testing for endogeneity/exogeneity of regressors, heteroskedasticity in an instrumental variables context, and fitting regression models with ARMA errors will be considered, as well as a number of tests for stationarity of single or multiple time series, including stationarity in the presence of structural breaks. Stata graphics, under the hood Vince Wiggins, StataCorp, College Station, TX Stata's graphics are more flexible than many realise. We will exploit this flexibility and explore a potpourri of topics, some of interest to all graphers and others primarily of interest to those creating highly customised graphs or new graph commands. Among the topics will be creating custom schemes to control the appearance of graphs, including an overview of the format and contents of scheme files. We will also examine twoway graphs as a platform for creating custom graphs, some of which are not readily apparent. We will discuss techniques for managing data and leveraging twoway's native plottypes. Along the way we will introduce some new official and unofficial tools, and perhaps some downright dangerous, but useful, undocumented tricks. Separation brings analysts and their graphs together Matthew Barnes, Office for National Statistics, London The recent addition of support for the CMYK colour model to Stata 8 allows graphs from Stata to be used where colour separation is required for printing. This paper outlines work being done at the Office for National Statistics to use Stata for graphics in our flagship "Economic Trends" publication. This work aims to reduce the burden on our design teams, allow later deadlines and ensure that analysts have more control over the appearance of their graphics in the final publication. Circular statistics in Stata, revisited Nicholas J. Cox, Department of Geography, University of Durham Circular data are a large class of directional data, which are of interest to scientists in many fields, including biologists (movements of migrating animals), meteorologists (winds), geologists (directions of joints and faults), and geomorphologists (landforms, oriented stones). These examples are all recordable as compass bearings relative to North. Other examples include phenomena that are periodic in time, including those dependent on time of day (in biomedical statistics: hospital visits or times of birth) or time of year (in applied economics: unemployment or sales variations). The analysis of circular data is an odd corner of statistical science that many never visit, even though it has a long and curious history. Moreover, it seems that no major statistical language provides direct support for circular statistics. This talk describes the development and use of some routines that have been written in Stata, primarily to allow graphical and exploratory analyses. In 2004, such routines are being rewritten, especially to allow use of the new graphics of Stata 8. Ulrich Kohler, WZB, Berlin Biplots display correlations and differences in means and standard deviations of many variables on one graph, together with the values of the plotted variables and approximations of the Euclidean distance between the observations. Biplots are useful for identifying clusters of observations, guiding interpretation of factor analyses, detecting multivariate outliers and getting an idea about the correlation structure of the data. The talk will demonstrate the merits of biplots and discuss the development of a new version of biplot for Stata 8.2. Tabulation of multiple responses Ben Jann, Soziologie, ETH Zürich Although multiple response questions are quite common in survey research, Stata's official release does not provide much possibility for an effective analysis of multiple response variables. For example, in a study on drug addiction an interview question might be: "Which substances did you consume during the last four weeks?" The respondents just list all the drugs they took if any, e.g., an answer could be "cannabis, cocaine, heroin" or "ecstasy, cannabis" or "none", etc. Usually, the responses to such questions are held as a set of variables and, therefore, cannot be easily tabulated. I will address this issue and present a new module to compute one- and two-way tables of multiple responses. The module supports several types of data structure, provides significance tests and offers various options to control the computation and display of the results. Controlling for time-dependent confounding using marginal structural models Zoe Fewell1, M. A. Hernán2, F. Wolfe3, K. Tilling1, H. Choi4 and J. A. C. Sterne1 Longitudinal studies in which exposures, confounders and outcomes are measured repeatedly over time have the potential to allow causal inferences about the effects of exposure on outcome. There is particular interest in estimating the causal effects of medical treatments (or other interventions) in circumstances in which a randomised controlled trial is difficult or impossible. However, standard methods for estimating exposure effects in longitudinal studies are biased in the presence of time-dependent confounders affected by prior treatment. This talk describes the use of marginal structural models (described by Robins et al.) to estimate exposure or treatment effects in the presence of time-dependent confounders affected by prior treatment. The method is based on deriving inverse-probability-of-treatment weights, which are then used in a pooled logistic regression model to estimate the causal effect of treatment on outcome. We demonstrate the use of marginal structural models to estimate the effect of methotrexate on mortality in persons suffering from rheumatoid arthritis. Meta-analysis in Stata: history, progress and prospects Jonathan Sterne, Department of Social Medicine, University of Bristol Systematic reviews of randomised trials are now widely recognised to be the best way to summarise the evidence on the effects of medical interventions. A systematic review may (though it need not) contain a meta-analysis, `a statistical analysis which combines the results of several independent studies considered by the analyst to be "combinable" '. The first researcher to do a meta-analysis was probably Karl Pearson, in 1904. Sadly, Stata was not available at this time. The first Stata command for meta-analysis - the meta command - was published in the Stata Technical Bulletin in 1997, and exploited a facility, introduced in Stata version 5, to program graphics. It requires the user to derive an estimate of the effect of intervention, together with its standard error, for each study. The metan command, published in 1998, does analyses based on the 2 ×2 table for each study, and provides more detailed graphical displays. Facilities for cumulative meta-analysis and meta-regression, and tools for examining bias in meta-analysis, have since been introduced. It is perhaps surprising that Stata commands for meta-analysis are still entirely user-written. This means that the existing commands that produce graphics (a major advantage of the Stata commands compared with those available in other statistical packages) are outdated since the introduction of Stata 8 graphics. Possible ways forward will be discussed, and the talk will conclude with a discussion of developments in meta-analysis that could usefully be addressed by future Stata commands. Compliance-adjusted intervention effects in survival data Lois G. Kim and Ian R. White, MRC Biostatistics Unit, Cambridge Time-to-event endpoints are a common outcome of interest in randomised clinical trials. The primary analysis should usually be by intention-to-treat, giving an indication of the effectiveness of the intervention in a population as a whole. However, the benefit specifically for an individual receiving the intervention is becoming increasingly important as patient decisions become more evidence-based. Effectiveness is defined as the benefit of intervention as actually applied, and may be estimated from simple all-or-nothing compliance data. Efficacy, on the other hand, is the benefit of intervention under ideal circumstances, and requires more complex compliance data. Intervention effectiveness and efficacy after accounting for non-compliance can be estimated in various ways, some of which have already been implemented in Stata (e.g. strbee). Recently, Loeys and Goetghebeur (2003) provided new methodology using proportional-hazards techniques in survival data where compliance is all-or-nothing in the intervention arm and perfect in the control arm. Here, their method is implemented in Stata. The output is a hazard ratio for the effectiveness of intervention, adjusted for observed adherence to intervention in the treated group. An example application is discussed for a subset of a large, randomised trial of screening where the average benefit of 26% risk reduction becomes a 34% risk reduction for individuals attending screening. Loeys, T. and Goetghebeur, E. 2003. A causal proportional hazards estimator for the effect of treatment actually received in a randomized trial with all-or-nothing compliance. Biometrics 59: 100-105. From datasets to resultssets in Stata Roger Newson, Department of Public Health Sciences, King's College, London A resultsset is a Stata dataset created as output by a Stata program. It can be used as input to other Stata programs, which may in turn output the results as publication-ready plots or tables. Programs that create resultssets include xcontract, xcollapse, parmest, parmby and descsave. Stata resultssets do a similar job to SAS output data sets, which are saved to disk files. However, in Stata, the user typically has the options of saving a resultsset to a disk file, writing it to the memory (overwriting any pre-existing data set), or simply listing it. Resultssets are often saved to temporary files, using the tempfile command. This lecture introduces programs that create resultssets, and also programs that do things with resultssets after they have been created. listtex outputs resultssets to tables that can be inserted into a Microsoft Word, HTML or TEX document. eclplot inputs resultssets and creates confidence interval plots. Other programs, such as sencode and tostring, process resultssets after they are created and before they are listed, tabulated or plotted. These programs, used together, have a power not always appreciated if the user simply reads the on-line help for each package. Applying the Cox proportional hazards regression model to competing risks Abdel G. A. Babiker, MRC Clinical Trials Unit, London In the presence of dependent competing risks in survival analysis, the Cox proportional hazards model can be utilised to examine covariate effects on the cause-specific hazard function for each type of failure. The method proposed by Lunn and McNeil (1995) requires data augmentation. With k failure types, the data would be duplicated k times, one record for each failure type. Either a stratified or an unstratified analysis could be used, depending on whether the assumption of proportional hazards holds. If the proportional hazards assumption does not hold across the causes, the stratified analysis should be used, which is equivalent to fitting a separate model for each failure type. The unstratified analysis assumes a constant hazard ratio between failure types and this could be fitted by including an indicator variable as a covariate. We will show how both approaches could be fitted on augmented data using stcox. In addition to the parameter estimates and their standard errors, the program has an option to produce cumulative incidence functions with pointwise confidence limits. Lunn, M. and McNeil, D. 1995. Applying Cox regression to competing risks. Biometrics 51: 524-532. Genome-wide linkage scans and basic bioinformatics implemented using Stata/SE Toby Andrew, Twin & Genetic Epidemiology Research Unit, Department of Medicine, St Thomas' Hospital Searches for genes using linkage analyses with genetic markers placed across the entire human genome are hypothesis-free experiments, which represent an extreme form of multiple testing. As such, the low p-values required to obtain nominal significance make accurate diagnostics essential to assess model fit and to eliminate naive incorrect results. In hypothesis-driven single tests, researchers usually take good care to assess model fit and the validity of model assumptions, but such concerns are usually ignored when it comes to linkage analysis. This is particularly problematic where low thresholds (p < 0.0001) can result in extreme sensitivity to outlying observations and for some models (e.g. standard variance component analysis), greater sensitivity to violation of model assumptions. Here we attempt to address these problems for genomic data based on 1300 healthy sib-pairs (dizygotic twins) using modified Haseman-Elston regression-based linkage analysis for quantitative traits, in which sib-pair phenotypic covariance is correlated with genetic marker covariance. The statistical theory underpinning the implementation of tests for linkage using generalized linear models (GLM) (glm in Stata) is documented in detail elsewhere. In brief, the advantage of analysing sib-pairs using GLM is that the approach shares all of the strengths of OLS and variance components, but none of their weaknesses. These are that (1) unlike OLS, the residual errors are correctly specified with a gamma distribution and known heteroscedasticity is accounted for; (2) unlike standard variance components, by freely estimating the coefficient of variation, GLM is robust to phenotypic deviations from multivariate normality. Just as important are the practical advantages. With the release of Stata8/Special Edition for large datasets, we have been able to store and check genetic markers for all 22 pairs of autosomal chromosomes plus sex chromosomes. In addition, we have generated 2-point and multipoint allele-sharing identical by descent (IBD) elsewhere and imported this into Stata. Using Stata scripts with a simple loop structure that calls on the glm command, we are able to perform genome-wide scans and save any summary statistics to file. We have been able to utilise the following features in Stata: 1. correct diagnostics on a genome-wide basis that are not normally made available to users of applied linkage packages Finally, we also can perform basic, but powerful bioinformatics tasks such as: 1. using the xpose command to summarise marker information by chromosome and sib-pair Barber, M.J., Cordell, H.J., MacGregor, A.J. and Andrew, T. 2004. Gamma regression improves Haseman-Elston and variance components linkage analysis for sib-pairs. Genetic Epidemiology 26(2): 97-107. Evaluation of diagnostic tests for diseases in pregnancy: some statistical issues Paul T. Seed, Dept of Obstetrics & Gynaecology, GKT School of Medicine, King's College, London A diagnostic test is used typically because it is cheaper, quicker or less invasive than the reference standard, but may not be as reliable. Diagnostic tests are evaluated against a reference standard (sometimes called "Gold Standard"), regarded as completely accurate. Commands diagt and diagti have been developed to evaluate binary tests, and provide all the standard measures of performance (including sensitivity, specificity, likelihood ratios, and predictive values, with appropriate confidence intervals. A "prevalence" option adjusts for different case-mix, and evaluates the test result for a particular patient with known pre-test risk. The use of ROC curves for ordered categorical and continuous data will be considered, in particular the determining of a suitable cut-off value. Where the distribution of a continuous measure can be adequately modelled, the likelihood ratio can be used to determine the absolute risk of an individual patient. Appropriate Stata commands for these analyses will be demonstrated. William W. Gould, StataCorp, College Station, TX Bill Gould, who is President of StataCorp, and, more importantly for this meeting, the head of development, will ruminate about work at Stata over the last year and about ongoing activity. |
|||||
|