Stata

Stata Features: Export tables to Excel

A new feature in Stata 13, designed by Kevin Crow, putexcel, allows you to easily export matrices, expressions and stored results to an Excel file. Combining putexcel with a Stata command’s stored results allows you to create the table displayed in your Stata Results window in an Excel file, as follows.

A stored result is simply a scalar, macro or matrix stored in memory after you run a Stata command. The two main types of stored results are e-class (for estimation commands) and r-class (for general commands). You can list a command’s stored results after it has been run by typing ereturn list (for estimation commands) or return list (for general commands). Here’s a simple example by loading the auto dataset and running correlate on the variables foreign and mpg:

. sysuse auto
(1978 Automobile Data)

. correlate foreign mpg
(obs=74)

             |  foreign      mpg
-------------+------------------
     foreign |   1.0000
         mpg |   0.3934   1.0000

Because correlate is not an estimation command, use the return list command to see its stored results.

. return list

scalars:
                  r(N) =  74
                r(rho) =  .3933974152205484

matrices:
                  r(C) :  2 x 2

Now you can use putexcel to export these results to Excel. The basic syntax of putexcel is:

putexcel excel_cell=(expression) … using filename [, options]

If you are working with matrices, the syntax is:

putexcel excel_cell=matrix(expression) … using filename [, options]

It’s easy to build the above syntax in the putexcel dialog (there’s also helpful YouTube tutorial about the dialog here). List the matrix r(C) to show the below:

. matrix list r(C)

symmetric r(C)[2,2]
           foreign        mpg
foreign          1
    mpg  .39339742          1

To re-create the table in Excel, you need to export the matrix r(C) with the matrix row and column names. The command to type in your Stata Command window is:

putexcel A1=matrix(r(C), names) using corr

Note that to export the matrix row and column names, the example used the names option after we specifed the matrix r(C). When corr.xlsx file is opened in Excel, the table below is displayed:

Producing Excel tables with Stata 1

Next let’s try a more involved example. Load the auto dataset, and run a tabulation on the variable foreign. Because tabulate is not an estimation command, use the return list command to see its stored results.

. sysuse auto
(1978 Automobile Data)

. tabulate foreign

   Car type |      Freq.     Percent        Cum.
------------+-----------------------------------
   Domestic |         52       70.27       70.27
    Foreign |         22       29.73      100.00
------------+-----------------------------------
      Total |         74      100.00

. return list

scalars:
                  r(N) =  74
                  r(r) =  2

tabulate is different from most commands in Stata in that it does not automatically save all the results we need into the stored results (we will use scalar r(N)). The matcell() and matrow() options of tabulate are used to save the results produced by the command into two Stata matrices.

. tabulate foreign, matcell(freq) matrow(names)

   Car type |      Freq.     Percent        Cum.
------------+-----------------------------------
   Domestic |         52       70.27       70.27
    Foreign |         22       29.73      100.00
------------+-----------------------------------
      Total |         74      100.00

. matrix list freq

freq[2,1]
    c1
r1  52
r2  22

. matrix list names

names[2,1]
    c1
r1   0
r2   1

The putexcel commands used to create a basic tabulation table in Excel column 1 row 1 are:

putexcel A1=("Car type") B1=("Freq.") C1=("Percent") using results, replace
putexcel A2=matrix(names) B2=matrix(freq) C2=matrix(freq/r(N)) using results,
     modify

Below is the table produced in Excel using these commands:

Producing Excel tables with Stata 1

Again this is a basic tabulation table. You probably noticed that the Cum. column or the Total row in the export table are not displayed. Also the Car type column contains the numeric values (0,1), not the value labels (Domestic, Foreign) of the variable foreign and that the Percent column is not formatted correctly. To get the exact table displayed in the Results window into an Excel file takes a little programming. With a few functions and a forvalues loop, users can easily export any table produced by running the tabulate command on a numeric variable.

There are two extended macro functions, label and display, that can help with this. The label function can extract the value labels for each variable and the display function can correctly format numbers for the numeric columns. Last, you can use forvalues to loop over the rows of the returned matrices to produce the final tables.




 

Back to top 

Return to:  Stata Features  |  Stata  |  Home

London UK Sep 2014 Stata Bookmark and Share

20th London Stata Users Group Meeting Proceedings

Date:
Location:
11-12 September 2014
Cass Business School, City University London, UK (View map)
Cass Business School, City University London Stata  Stata 13    Timberlake Consultants | Statistics | Econometrics | Forecasting
Overview Proceedings Past Meetings  

Overview

The 20th London Stata Users Group Meeting took place on 11-12 September 2014 at Cass Business School, London, UK.

The Stata Users Group Meeting is a two-day international conference where the use of Stata is discussed across a wide-ranging breadth of fields and environments. Established in 1995, the UK meeting is the longest-running series of Stata Users Group Meetings. The meeting is open to everyone. In past years, participants have travelled from around the world to attend the event. Representatives from StataCorp are also in attendance.

 

Back to top 

Proceedings

View the abstracts and download the presentations for the 20th London Stata Users Group Meeting below (you can also click here to download a PDF version of the abstracts).

Easy-to-use packages for estimating rank and spline parameters

Roger B. Newson
Department of Primary Care and Public Health, Imperial College London

So-called non-parametric methods are in fact based on estimating and testing parameters, usually either rank parameters or spline parameters. Two comprehensive packages for estimating these are somersd (for rank parameters) and bspline (for spline parameters). Both of these estimate a wide range of parameters, but both are frequently found to be difficult to use by casual users. This presentation introduces rcentile, an easy-to-use front end for somersd, and polyspline, an easy-to-use front end for bsplinercentile estimates percentiles with confidence limits, optionally allowing for clustered sampling and sampling-probability weights. The confidence intervals are saved in a Stata matrix, with one row per percentile, which the user can save to a resultsset using the xsvmat package. polyspline inputs an X-variable and a user-defined list of reference points and outputs a basis of variables for a polynomial or for another unrestricted spline. This basis can be included in the covariate list for an estimation command, and the corresponding parameters will be values of the polynomial or spline at the reference points, or differences between these values. By default, the spline will simply be a polynomial, with a degree one less than the number of reference points. However, if the user specifies a lower degree, then the spline will have knots interpolated sensibly between the reference points.

Additional materials:

uk14_newson.pdf

Using Stata for sequence analysis

Brendan Halpin
University of Limerick

Sequence analysis (SA) is a very different way of looking at categorical longitudinal data, such as life-course or labor-market histories (or any ordered categorical data, for that matter). Instead of focusing on transition rates (for example, via hazard rate, Markov or panel models), it takes individual time series and compares them as wholes. It has significant advantages at a descriptive and exploratory level and can help detect patterns that conventional methods will overlook. As availability of longitudinal data increases, this becomes a significant advantage.

SA hinges on defining measures of similarity between sequences, typically to generate data-driven classifications, for example, by cluster analysis. Most SA uses the optimal matching distance, but other measures are in use. There is some controversy about the applicability of SA algorithms to social science data and about their parameterization. Comparison of different methods and parameterizations helps clarify the issues.

For a long time, TDA was the only package social scientists had access to for SA, but in recent years, both Stata and R have had relevant functionality, in Stata’s case provided by the sq and sadi packages.

In this talk, I will discuss the current state of the sadi package. sadi differs from sq in being based on a plugin; therefore, it is significantly faster: many of the distance measures are computationally intensive, and typically, N(N-1)/2 comparisons will be made for N observations. sadi also provides additional distance measures, including dynamic Hamming, time-warp edit distance, and a version of Elzinga’s number of matching subsequences measure. It includes tools for inspecting and graphing sequence data and for comparing distance measures and the resulting cluster analyses.

I will also briefly discuss the advantages and disadvantages of using plugins rather than Mata and make some remarks about cross-compiling plugins under Linux.

A new command for plotting regression coefficients and other estimates

Ben Jann
University of Bern

Graphical display of regression results has become increasingly popular in presentations and the scientific literature because, in many cases, graphs are much easier to read than tables. In Stata, such plots can be produced by the marginsplot command. However, while marginsplot is very versatile and flexible, it has two major limitations: it can process only results left behind by margins, and it can handle only one set of results at a time. In this presentation, I will introduce a new command called coefplot, which overcomes these limitations. It plots results from any estimation command and combines results from several models into a single graph. The default behavior of coefplot is to plot markers for coefficients and horizontal spikes for confidence intervals. However, coefplot can also produce various other types of graphs. The capabilities of coefplot are illustrated using a series of examples.

Making interactive online visualizations with stata2d3 and stata2leaflet

Robert Grant
St George’s, University of London and Kingston University

The last three years have seen explosive growth in the variety and sophistication of interactive online graphics. These are mostly implemented in the web language JavaScript, with the Data Driven Documents (D3) library being the most popular and flexible at present. Leaflet is a mapping library also being widely used. R users have some packages that translate their data and specifications into interactive graphics and maps, which write a text file containing the HTML and JavaScript instructions that make up a webpage containing the desired visualization. This translation into a webpage is easily achieved in Stata, and I will present the stata2leaflet command, which produces zoomable, clickable online maps.

Contemporary interactive graphs benefit from allowing the viewer to filter and select data of interest, which is a second layer of specification implemented in the stata2d3 commands. stata2d3 capitalizes on the consistency of Stata graph syntax by parsing and translating a standard Stata graph command into a webpage. Users can choose to include explanatory comments against each line in the source code, which are invisible to viewers but help them to learn HTML and JavaScript and make further refinements.

Additional materials:

uk14_grant.pdf

Small multiples, or the science and art of combining graphs

Nicholas J. Cox
Durham University

Good graphics often exploit one simple graphical design that is repeated for different parts of the data, which Edward R. Tufte dubbed the use of small multiples. In Stata, small multiples are supported for different subsets of the data with by() or over() options of many graph commands; users can easily emulate this in their own programs by writing wrapper programs that call twoway or graph bar and its siblings. Otherwise, specific machinery offers repetition of a design for different variables, such as the (arguably much under-used) graph matrix command. Users can always put together their own composite graphs by saving individual graphs and then combining them.

I reviewed small multiples at the 2013 London meeting and this presentation offers further modest automation of the same design repeated for different data. Four recent general programs allow small multiples in different ways. crossplot offers a simple student-friendly graph matrix for each y and each x variable specified, which is more general than a scatter plot matrix. combineplot is a command for combining univariate or bivariate plots for different variables. designplot is a command for showing univariate summaries that generalizes and rewrites grmeanby in graph dot style. subsetplot automates display of different subsets of data with the complement as backdrop. Recent additions to some earlier programs are also reviewed.

Additional materials:

uk14_cox.ppt

Simulating simple and complex survival data

Michael Crowther
Department of Health Sciences, University of Leicester

Simulation studies are conducted to assess novel and currently used methods in practice, to better assess and understand the frameworks under question. In survival analysis, we are interested in simulating both an event and a censoring distribution to better reflect clinical data. In this talk, I will describe how to simulate survival times from simple parametric distributions and then move to a more general framework, illustrating how to simulate from a general user-defined hazard function. This can incorporate any combination of a complex baseline hazard function with turning points, time-dependent effects, random effects, and nonlinear covariate effects. This is achieved through a two-stage algorithm incorporating numerical integration nested within root-finding techniques. The methods will be illustrated using the publicly available survsim package.

Additional materials:

uk14_crowther.pdf

Transformation survival models

Yulia Marchenko
StataCorp, College Station, TX

The Cox proportional hazards model is one of the most popular methods to analyze survival or failure-time data. The key assumption underlying the Cox model is that of proportional hazards. This assumption may often be violated in practice. Transformation survival models extend the Cox regression methodology to allow for nonproportional hazards. They represent the class of semiparametric linear transformation models, which relates an unknown transformation of the survival time linearly to covariates. In my presentation, I will describe these models and demonstrate how to fit them in Stata.

Additional materials:

uk14_marchenko.pdf

Floating-point numbers: A visit to the looking glass world

William Gould
StataCorp, College Station, TX

In lieu of my usual Report to users, I will talk on floating-point numbers. Researchers do not adequately appreciate that floating-point numbers are a simulation of real numbers; as with all simulations, some features are preserved while others are not. When one writes code, or even do-files, treating the computer’s floating numbers as if they were real numbers can lead to substantive problems and to numerical inaccuracy. In this, the relationship between computers and real numbers is not entirely unlike the relationship between tea and Douglas Adams’s Nutri-Matic drink dispenser. The Nutri-Matic produces a concoction that is "almost, but not quite, entirely unlike tea". I show what the universe would be like if it were implemented in floating-point rather than in real numbers. The floating-point universe turns out to be nothing like the real universe and probably could not be made to function. Without jargon and without resort to binary, I show how floating-point numbers are implemented on an imaginary base-10 computer and quantify the kinds of errors that can arise. In this, float-point subtraction stands out as really being almost, but not quite, entirely unlike subtraction. I show how to work around such problems. The point of the talk is to build your intuition about the floating-point world so that you as a researcher can predict when calculations might go awry, know how to think about the problem, and determine how to fix it.

Additional materials:

uk14_gould.pdf

A generalized boxplot for skewed and heavy-tailed distributions implemented in Stata

Vincenzo Verardi
Université Libre de Bruxelles

The boxplot is probably the most commonly used tool to represent the distribution of the data and identify atypical observations in a univariate dataset. The problem with the standard boxplot is that as soon as asymmetry or tail heaviness appears, the percentage of values identified as atypical becomes excessive. To cope with this issue, Hubert and Vandervieren (2008) proposed an adjusted boxplot for skewed data. Their idea is to move the whiskers of the boxplot according to the degree of asymmetry of the data. The rule to set the whiskers of the adjusted boxplot was found by running a large number of simulations using a wide range of (moderately) skewed distributions. The idea was to find a rule that guaranteed that 0.7% of the observations would lie outside the interval delimited by the whiskers. Even if their rule works satisfactorily for most commonly used distributions, it suffers from some limitations: (i) the adjusted boxplot is not appropriate for severely skewed distributions and for distributions with heavy tails; (ii) it is specifically related to a theoretical rejection rate of 0.7%; (iii) it is extremely sensitive to the estimated value of the asymmetry parameter; and (iv) it requires a substantial computational complexity, O(n \log n).

To tackle these drawbacks, we propose a much simpler method to find the whiskers of the boxplot in case of (eventually) skewed and heavy-tailed data. We apply a simple rank-preserving transformation on the original data so that the transformed data can be adjusted by a so-called Tukey g-and-h distribution. Using the quantiles of this distribution, we can easily recover whiskers of the boxplot related to the original data. The computational complexity of the proposed method is O(n), the same as the standard boxplot.

Reference:

Hubert, M., and E. Vandervieren. 2008. An adjusted boxplot for skewed distributions. Computational Statistics & Data Analysis 52: 5186--5201.

Additional materials:

uk14_gould.pdf

Data visualization with Stata: How to adapt APIs and take benefit of HTML5 Canvas to create dynamic charts and maps with Stata

Thomas Roca
Agence Française de Développement, AFD

Data visualization is a burgeoning field at the crossroads of design, computer science, and statistics. Using HTML and applying data visualization techniques allows the creation of elegant and insightful representations with Stata. Nevertheless, creating original "dataviz" from an empty page is rough: it requires specific programming knowledge. Thankfully, many pieces of code have been developed under GNU/GPL or Apache licence, accessible for free (Google APIs are the most famous ones).

However, these pieces of code need to be adapted; many options are available to modify displays (texts, legends, labels, colours, size, type of representation, etc.) Besides, HTML5 now embeds a powerful graphic and drawing engine: Canvas, which can be mobilized to represent any data, starting from scratch. As underlying datasets need to be reshaped and organized, a fair amount of programming becomes necessary; thus, Stata users possess a great asset to fill in and format HTML and Java content.

Specific programs will be discussed:

  • Google Geomap package creates heat maps using Google’s API.
  • Donut chart package builds two-layer pie charts supported by Highchart’s API.

The project "How good are you converting your income into..." is a simple data visualization created with Stata using HTML5 Canvas, which easily allows flagging outliers. This dataviz will be presented as an introduction to the use of Canvas with Stata.

Dynamic Scatter Plot, fuelled with HTML5 Canvas, can represent three variables, and displays additional information when you scroll over with the mouse. Furthermore, two Dynamic Scatter Plots can be superimposed to facilitate comparisons.

To help you grasp the possibility offered by Stata in web programming, I will present the Country Dashboards project. It shows how to create a comprehensive web portal, embedding hundreds of dataviz and webpages built with Stata.

Additional materials:

uk14_verardi.pdf
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2302578

Using Mata to accelerate the familywise error rate calculation for multi-arm multi-stage clinical trial designs

Daniel Bratton
MRC Clinical Trials Unit at UCL, London, UK

More efficient trial designs are needed to combat the increasing cost of drug development. A class of trial designs that can help to accelerate this process, known as multi-arm multi-stage (MAMS) designs, has been proposed and used to much effect in oncology. To facilitate the design of these trials evaluating time to event outcomes, the nstage program was introduced for Stata in 2009. This program provides an estimate of the sample size requirement together with estimates of the pairwise operating characteristics and stage durations.

An important quantity in a multi-arm trial is its familywise error rate (FWER), that is, the probability of incorrectly recommending at least one ineffective treatment at the end of a trial. Strong control of the FWER is often a requirement in any confirmatory multi-arm study. A subroutine was therefore recently added to nstage to calculate the FWER of a MAMS design by simulating trial-level data, specifically the z-test statistics for the treatment effect in each arm at each stage of the study.

The calculation is relatively quick, taking less than 10 seconds for designs with up to 6 arms and 5 stages. However, it can be made more efficient by performing the calculation in Mata. This involves the use of three-dimensional matrices that, despite being unable to be used directly in Mata, can be generated through the use of pointers. The speed of the Mata calculation over that using only Stata increases with the number of arms and stages, with almost a 50% reduction in computing time for a 6-arm 5-stage design. Although just a few seconds are saved for a single design, the cumulative savings in time are considerable when searching over multiple designs to find the one most suitable or most efficient for the MAMS trial in question.

In this talk, I describe the calculations in Stata and Mata and compare their speeds for designs with various numbers of arms and stages. The use of pointers in the FWER calculation in Mata is described, and I discuss their advantages and potential use in other areas.

Additional materials:

uk14_bratton.pdf

Analysis of time-stratified case-crossover studies in environmental epidemiology using Stata

Aurelio Tobías
Spanish Council for Scientific Research, Barcelona, Spain

Ben Armstrong and Antonio Gasparrini
London School of Hygiene and Tropical Medicine

The time-stratified case-crossover design is widely used in environmental epidemiology to analyze the short-term effects of environmental risk factors, such as air pollution or temperature, on human health. It compares the exposure level in the day when the health event occurs (case day) with the levels in control days chosen with alternative selection methods. Standard analytical approaches to case-crossover analysis, based on conditional logistic regression (the clogit command in Stata), require a considerable amount of data management. Here we introduce the gencco command to reshape datasets from time-series to time-stratified case-crossover designs. Then we will discuss alternative statistical models to perform case-crossover analysis for aggregated data using Poisson and overdispersed Poisson regression (poisson and glm) and conditional Poisson regression (xtpoisson). Furthermore, we also introduce an updated command for conditional Poisson to allow for overdispersion (xtpoisson_addOD). Examples will be given using air pollution and mortality datasets, although these methods can be applicable generally in other contexts.

Additional materials:

uk14_tobias.ppt

Handling treatment changes in randomized trials

Ian R. White
MRC Biostatistics Unit, Cambridge Institute of Public Health, Cambridge

Treatment changes in randomized trials are common: for example, in a trial evaluating psychotherapy, individuals allocated to psychotherapy may attend only partially or not at all; or in a trial evaluating a drug treatment, individuals allocated to no drug treatment may nevertheless receive the treatment. The issue is especially important in drug trials for late stage cancer where control group members typically receive the active treatment on disease progression. This talk focuses on time-to-event outcomes. In some cases, it is important to estimate the effect of the treatment if some or all of these treatment changes had not occurred: for example, for a health economic model exploring whether a drug should be available on the NHS, we would need to compare survival of a treated group with survival of a completely untreated group.

Twelve years ago, I published strbee, which implements in Stata the rank-preserving structural failure time model (RPSFTM) of Robins and Tsiatis (1991). This is a model that infers a comparison of the randomized groups if they had had different treatment experiences; estimation is based only on comparisons of randomized groups.

Over the intervening years, the RPSFTM has been increasingly (though not widely) used, and various problems have been identified. First, it assumes treatment benefit is the same whenever treatment is received, and a sensitivity analysis to address possible departures from this assumption is needed. Second, the method’s power is low and declines as follow-up extends: later times that contribute little information are given the same weight as earlier times. Third, a wider range of estimation procedures is required.

I will review the RPSFTM method and its alternatives in a Stata context, and I will describe an update of strbee which addresses the above issues.

Reference:

Robins, J. M., and A. Tsiatis. 1991. Correcting for non-compliance in randomized trials using rank-preserving structural failure time models. Communications in Statistics – Theory and Methods 20: 2609--2631.

Additional materials:

uk14_white.pptx

Bayesian analysis with Stata: Application to neonatal mortality in the UK

John R. Thompson
Department of Health Sciences, University of Leicester

The Bayesian approach to statistical analysis has many theoretical advantages, but in the past, its use has been limited by a lack of suitable statistical software. In a book published this year by Stata Press called Bayesian Analysis with Stata, I have tried to show that Stata can be used for Bayesian as well as frequentist statistical analysis.

In this talk, I will present a Bayesian analysis of neonatal mortality rates in England and Wales and show how it can be implemented in Stata or Mata or by calling WinBUGS from within Stata.

Over the last few decades, neonatal mortality has been falling steadily throughout the world, but the UK lags some way behind many other developed countries. Each year, data on neonatal mortality categorized by birth weight and maternal age are published. We will use a Bayesian analysis of these data to see if the declining rate over time has been similar in all the categories.

Additional materials:

uk14_thompson.pdf

ntreatreg: a Stata module for estimation of treatment effects in the presence of neighborhood interactions

Giovanni Cerulli
CERIS (Institute for Economic Research on Firm and Growth), Roma

This paper presents a parametric counter-factual model identifying average treatment effects (ATEs) by conditional mean independence when externality (or neighborhood) effects are incorporated within the traditional Rubin-potential outcome model.

As such, it tries to generalize the usual control-function regression, widely used in program evaluation and epidemiology, when the stable unit treatment value assumption (SUTVA) is relaxed. As a by-product, the paper also presents ntreatreg, a user-written Stata routine for estimating ATEs when social interaction may be present. Finally, an instructional application of the model and of its Stata implementation (using ntreatreg) through two examples (the first on the effect of housing location on crime; the second on the effect of education on fertility) is shown and results compared with a no-interaction setting.

Additional materials:

uk14_cerulli.pdf

Extending Stata’s capabilities for asymptotic covariance matrix estimation

Christopher F. Baum
Boston College and DIW Berlin

Mark E. Schaffer
Heriot-Watt University, CEPR and IZA

The avar routine (Baum and Schaffer, SSC) constructs the "filling" for a number of flavors of "sandwich" covariance matrix estimators, including HAC, one- and two-way clustering, common cross-panel autocorrelated errors, etc. We show how avar can be used as a building block to construct VCEs that go beyond the Eicker-Huber-White and one-way cluster-robust VCEs provided by Stata’s official _robust command. We also show how avar can be used to provide multiple-equation VCE estimates in a wider variety of circumstances than Stata’s official suest command.

Additional materials:

uk14_schaffer.pdf

Space-filling location selection

Philippe Van Kerm
CEPS/INSTEAD, Luxembourg

This presentation describes a Stata implementation of a space-filling location selection algorithm. The objective is to select a subset from a discrete list of locations so that the spatial coverage of the locations by the selected subset is optimized according to a geometric criterion. Such an algorithm designed for geographical site selection is useful more generally to determine a grid of points that "covers" a data matrix as needed in various nonparametric estimation procedures. Various examples illustrate usage of the user-written command spacefill.

Additional materials:

uk14_vankerm.pdf

Wishes and grumbles

William Gould and colleagues
StataCorp, College Station, TX

William Gould, president of StataCorp and, more importantly in this context, chief developer of Stata, and colleagues will be happy to receive wishes for developments in Stata and almost as happy to receive grumbles about the software.

 

Back to top 

Proceedings of Past Meetings

Use the links below for proceedings and presentations from previous meetings:




 

Back to top 

Return to:  Stata User Group Meetings  |  Stata  |  Home

 

 
Newsletter Registration