Stata

Stata Features: Export tables to Excel

A new feature in Stata 13, designed by Kevin Crow, putexcel, allows you to easily export matrices, expressions and stored results to an Excel file. Combining putexcel with a Stata command’s stored results allows you to create the table displayed in your Stata Results window in an Excel file, as follows.

A stored result is simply a scalar, macro or matrix stored in memory after you run a Stata command. The two main types of stored results are e-class (for estimation commands) and r-class (for general commands). You can list a command’s stored results after it has been run by typing ereturn list (for estimation commands) or return list (for general commands). Here’s a simple example by loading the auto dataset and running correlate on the variables foreign and mpg:

. sysuse auto
(1978 Automobile Data)

. correlate foreign mpg
(obs=74)

             |  foreign      mpg
-------------+------------------
     foreign |   1.0000
         mpg |   0.3934   1.0000

Because correlate is not an estimation command, use the return list command to see its stored results.

. return list

scalars:
                  r(N) =  74
                r(rho) =  .3933974152205484

matrices:
                  r(C) :  2 x 2

Now you can use putexcel to export these results to Excel. The basic syntax of putexcel is:

putexcel excel_cell=(expression) … using filename [, options]

If you are working with matrices, the syntax is:

putexcel excel_cell=matrix(expression) … using filename [, options]

It’s easy to build the above syntax in the putexcel dialog (there’s also helpful YouTube tutorial about the dialog here). List the matrix r(C) to show the below:

. matrix list r(C)

symmetric r(C)[2,2]
           foreign        mpg
foreign          1
    mpg  .39339742          1

To re-create the table in Excel, you need to export the matrix r(C) with the matrix row and column names. The command to type in your Stata Command window is:

putexcel A1=matrix(r(C), names) using corr

Note that to export the matrix row and column names, the example used the names option after we specifed the matrix r(C). When corr.xlsx file is opened in Excel, the table below is displayed:

Producing Excel tables with Stata 1

Next let’s try a more involved example. Load the auto dataset, and run a tabulation on the variable foreign. Because tabulate is not an estimation command, use the return list command to see its stored results.

. sysuse auto
(1978 Automobile Data)

. tabulate foreign

   Car type |      Freq.     Percent        Cum.
------------+-----------------------------------
   Domestic |         52       70.27       70.27
    Foreign |         22       29.73      100.00
------------+-----------------------------------
      Total |         74      100.00

. return list

scalars:
                  r(N) =  74
                  r(r) =  2

tabulate is different from most commands in Stata in that it does not automatically save all the results we need into the stored results (we will use scalar r(N)). The matcell() and matrow() options of tabulate are used to save the results produced by the command into two Stata matrices.

. tabulate foreign, matcell(freq) matrow(names)

   Car type |      Freq.     Percent        Cum.
------------+-----------------------------------
   Domestic |         52       70.27       70.27
    Foreign |         22       29.73      100.00
------------+-----------------------------------
      Total |         74      100.00

. matrix list freq

freq[2,1]
    c1
r1  52
r2  22

. matrix list names

names[2,1]
    c1
r1   0
r2   1

The putexcel commands used to create a basic tabulation table in Excel column 1 row 1 are:

putexcel A1=("Car type") B1=("Freq.") C1=("Percent") using results, replace
putexcel A2=matrix(names) B2=matrix(freq) C2=matrix(freq/r(N)) using results,
     modify

Below is the table produced in Excel using these commands:

Producing Excel tables with Stata 1

Again this is a basic tabulation table. You probably noticed that the Cum. column or the Total row in the export table are not displayed. Also the Car type column contains the numeric values (0,1), not the value labels (Domestic, Foreign) of the variable foreign and that the Percent column is not formatted correctly. To get the exact table displayed in the Results window into an Excel file takes a little programming. With a few functions and a forvalues loop, users can easily export any table produced by running the tabulate command on a numeric variable.

There are two extended macro functions, label and display, that can help with this. The label function can extract the value labels for each variable and the display function can correctly format numbers for the numeric columns. Last, you can use forvalues to loop over the rows of the returned matrices to produce the final tables.




 

Back to top 

Return to:  Stata Features  |  Stata  |  Home

19th London Stata Users Group Meeting: Abstracts

Contents

 

Back to top 

StataCorp presentations

Yulia Marchenko, StataCorp LP
ymarchenko@stata.com

Power of the power command

Stata 13s new power command performs power and sample-size analysis. The power command expands the statistical methods that were previously available in Statas sampsi command. I will demonstrate the power command and its additional features, including the support of multiple study scenarios and automatic and customizable tables and graphs. I will also present new functionality allowing users to add their own methods to the power command.

 

Back to top 

Long talks

Arne Rise Hole, University of Sheffield
a.r.hole@sheffield.ac.uk

Mixed logit modelling in Stata - an overview

The “workhorse” model for analysing discrete choice data, the conditional logit model, can be implemented in Stata using the official clogit and asclogit commands. While widely used, this model has several well-known limitations, which have led researchers in various disciplines to consider more flexible alternatives.

The mixed logit model extends the standard conditional logit model by allowing one or more of the parameters in the model to be randomly distributed. When modelling the choices of individuals, as is common in several disciplines including economics, marketing and transport, this allows for preference heterogeneity among respondents. Other advantages of the mixed logit model include the ability to allow for correlations across observations in cases where an individual made more than one choice, and relaxing the restrictive Independence from Irrelevant Alternatives property of the conditional logit model.

There are a range of commands which can be used to estimate mixed logit models in Stata. With the exception of xtmelogit, the official Stata command for estimating binary mixed logit models, all of them are user-written. The module which is probably best known is gllamm, but while very flexible it can be slow when the model includes several random parameters. This talk will focus on alternative commands for estimating logit models, with focus on the mixlogit module. We will also look at alternatives and extensions to -mixlogit- including the recent lclogit, bayesmlogit and gmnl commands. The talk will review the theory behind the methods implemented by these commands and present examples of their use.


Vincenzo Verardi, Free University of Brussels
vverardi@ulb.ac.be

Semiparametric regression in Stata

Semiparametric regression deals with the introduction of some very general non-linear functional forms in regression analyses. This class or regression models is generally used to fit a parametric model in which the functional form of a subset of the explanatory variables is not known and/or in which the distribution of the error term cannot be assumed of being of a specific type beforehand. To fix ideas, consider the partial linear model y = zb + f(x) + e, in which the shape of the potentially non-linear function of predictor x is particular interest. Two approaches to modelling f(x) are to use splines or fractional polynomials. This talk reviews other more general approaches, and the commands available in Stata to fit such models.

The main topic of the talk will be partial linear regression models, with some brief discussion also of so-called single index and generalized additive models. Though several semiparametric regression methods have been proposed and developed in the literature, these are probably the most popular one.

The general idea of partial linear regression models is that a dependent variable is regressed on i) a set of explanatory variables entering the model linearly and ii) a set of variables entering the model non-linearly but without assuming any specific functional form. Several estimators have been proposed in the literature and are available in Stata. For example, the semipar command makes available what is called the double residuals estimator introduced by Robinson (1988) which is consistent and efficient. Similarly, the plreg command fits an alternative difference-based estimator proposed by Yatchew (1998) that has similar statistical properties to Robinson’s estimator. These estimators will be briefly compared to identify some drawbacks and pitfalls of both methods.

A natural concern of researchers is how these estimators could be modified to deal with heteroskedasticity, serial correlation and/or endogeneity in cross-sectional data or how they could be adapted in the context of panel data to control for unobserved heterogeneity. As a consequence, a substantial part of the talk will be devoted to i) briefly explain how the plreg and semipar commands can be used to tackle these very common violations of the Gauss-Markov assumptions in cross-sectional data and ii) how the user written xtsemipar command makes a semiparametric regression easy to fit in the context of panel data.

Since it is sometimes possible to move towards pure parametric models, a test proposed by Hardle and Mammen (1993) built to check if the nonparametric fit can be satisfactorily approximated by a parametric polynomial adjustment of order p, will be described.

 

Back to top 

Other talks (alphabetical order of author)

Sara Allyón, University of Girona
sara.ayllon@udg.edu

From Stata to aML

This presentation explains how to exploit Stata to run multilevel multiprocess regressions with aML (software downloadable for free from http://www.applied-ml.com/ ). I show how a single do file can prepare the data set, write the control files, input the starting values and run the regressions without the need to manually open the aMLs Command Prompt window. In this sense, Stata helps to avoid the difficulties of running complicated regressions with aML by automatically generating the necessary files which avoids typo errors and easily allows changes in model specification. The paper contains an example of how well Stata and aML work together.


Jonathan Bartlett, London School of Hygiene and Tropical Medicine
jonathan.bartlett@lshtm.ac.uk

Multiple imputation of covariates in the presence of interactions and non-linearities

Multiple imputation (MI) is a popular approach to handling missing data, and an extensive range of MI commands is now available in official Stata. A common problem is that of missing values in covariates of regression models. When the substantive model for the outcome contains non-linear covariate effects or interactions, correctly specifying an imputation model for covariates becomes problematic. We present simulation results illustrating the biases which can occur when standard imputation models are used to impute covariates in linear regression models with a quadratic effect or interaction effect. We then describe a modification of the full conditional specification (FCS) or chained equations approach to MI, which ensures that covariates are imputed from a model which is compatible with a user-specified substantive model. We present the smcfcs Stata command which implements substantive model compatible FCS, and illustrate its application to a dataset.


Kit Baum, Boston College
baum@bc.edu

[joint with Mark Schaffer]

A general approach to testing for autocorrelation

Testing for the presence of autocorrelation in a time series, either in the univariate setting or using the residuals from the estimation of some model, is one of the most common tasks researchers face in the time-series setting. The standard Q test statistic is that introduced by Box and Pierce (1970), subsequently refined by Ljung and Box (1978). The original L-B-P test is applicable to univariate time series and to testing for residual autocorrelation under the assumption of strict exogeneity. Breusch (1978) and Godfrey (1978) in effect extended the L-B-P approach to testing for autocorrelations in residuals in models with weakly exogenous regressors. Both the L-B-P test and the Breusch-Godfrey test are available in Stata, the former for univariate time series via the wntestq command and the latter for postestimation testing following OLS using estat bgodfrey.

All the above tests have important limitations: (a) the tests are for autocorrelation up to order p, where under the null hypothesis the series or residuals are i.i.d.; (b) when applied to residuals from single-equation estimation, the regressors must all be at least weakly exogenous; (c) the tests are for single-equation models, and do not cover panel data.

We use the results of Cumby and Huizinga (1992) to extend the implementation of the Q test statistic of L-B-P-B-G to cover a much wider ranges of hypotheses and settings: (a) tests for the presence of autocorrelation of order p through q, where under the null hypothesis there may be autocorrelation of order p-1 or less; (b) tests following estimation in which regressors are endogenous and estimation is by IV or GMM methods; (c) tests following estimation using panel data. For (c) we show that the Cumby-Huizinga test, developed for the large-T setting, is, when applied to the large-N panel data setting and limited to testing for 2nd order serial correlation, formally identical to the test presented by Arellano and Bond (1991) and available in Stata via Roodmans abar command.


Giovanni Cerulli, Institute for Economic Research on Firms and Growth, Rome
g.cerulli@ceris.cnr.it

treatrew: a user-written Stata routine for estimating Average Treatment Effects by reweighting on propensity score

Reweighting is a popular statistical technique to deal with inference in presence of a non-random sample. In the literature, various reweighting estimators have been proposed. This paper presents the author-written Stata routine treatrew implementing the reweighting on propensity score estimator as proposed by Rosenbaum and Rubin (1983) in their seminal article, where parameters’ standard errors can be obtained either analytically (Wooldridge, 2010, p. 920-930) or via bootstrapping. Since an implementation in Stata of this estimator with analytic standard errors was still missing, this paper and the ADO-file and Help-file accompanying it, aims at filling this gap by providing the community with an easy-to-use implementation of the reweighting on propensity score method, as a valuable tool for estimating treatment effects under “selection-on-observables” (or “overt bias”). Finally, a Monte Carlo experiment to check the reliability of treatrew and to compare its results with other treatment effect estimators will be also provided.


Nicholas J. Cox, Durham University
n.j.cox@durham.ac.uk

Strategy and tactics for graphic multiples in Stata

Many, perhaps most, useful graphs compare two or more sets of values. Examples are two or more groups or variables (as distributions, time series, etc.) or observed and fitted values for one or more model fits. Often there can be a fine line in such comparisons between richly detailed graphics and busy, unintelligible graphics that lead nowhere. In this presentation I survey strategy and tactics for developing good graphic multiples in Stata.

Details include the use of over() and by() options and graph combine; the relative merits of super(im)posing and juxtaposing; backdrops of context; killing the key or losing the legend if you can; transforming scales for easier comparison; annotations and self-explanatory markers; linear reference patterns; plotting both data and summaries; plotting different versions or reductions of the data.

Datasets visited or revisited include James Shorts collation of observations from the transit of Venus; John Snows data on mortality in relation to water supply in London; Florence Nightingales data on deaths in the Crimea; deaths from the Titanic sinking; admissions to Berkeley; hostility in response to insult; and advances and retreats of East Antarctic Icesheet glaciers.

Specific programs discussed include graph dot; graph bar; sparkline (SSC); qplot (SJ) and its relatives; devnplot (SSC); stripplot (SSC); tabplot (SJ/SSC).


Michael J. Crowther, Department of Health Sciences, University of Leicester
michael.crowther@le.ac.uk

Multilevel mixed effects parametric survival analysis

Multilevel mixed effects survival models are used in the analysis of clustered survival data, such as repeated events, multi-centre clinical trials or individual patient data meta-analyses, to investigate heterogeneity in baseline risk and treatment effects. I present the stmixed command for the parametric analysis of clustered survival data with two levels. Mixed effects parametric survival models available include the exponential, Weibull and Gompertz proportional hazards models, the Royston-Parmar flexible parametric model, and the log-logistic, log-normal and generalised gamma accelerated failure time models. Estimation is conducted using maximum likelihood, with both adaptive and non-adaptive Gauss-Hermite quadrature available. I will illustrate the command through simulation and application to clinical datasets.


David Fisher, MRC Clinical Trials Unit Hub for Trials Methodology Research
d.fisher@ctu.mrc.ac.uk

Two-stage individual participant data meta-analysis and forest plots

Stata has a wide range of tools for performing meta-analysis, but presently not individual participant data (IPD) meta-analysis, in which the analysis units are within-study observations (e.g. patients) rather than aggregate study results.

I present ipdmetan, a command which facilitates two-stage IPD meta-analysis by fitting a specified model to the data of each study in turn and storing the results in a matrix. Features include subgroups, inclusion of aggregate (e.g. published) data, iterative estimates and confidence limits for the tau-squared measure of heterogeneity, and the analysis of treatment-covariate interactions. This latter is a great benefit of IPD collection, and is a subject on which my colleagues and I have published previously (Fisher et al, 2011, Journal of Clinical Epidemiology 64: 949-967). I shall discuss how ipdmetan facilitates our recommended approach, and its strengths and weaknesses, in particular one-stage vs. two-stage modelling and within- and between-trial information.

In addition, the graphics subroutine written for the metan package (Harris et al., 2008, Stata Journal 8: 3-28) has been greatly expanded to enable flexible, generalised forest plots for a variety of settings.  I shall demonstrate some of the possibilities and encourage feedback on how this may be developed further.

Examples will be given using real-world IPD meta-analyses of survival data in cancer, although the programs are applicable generally.


Piers Gaunt, Cancer Research UK Clinical Trials Unit, University of Birmingham,
GauntP@bham.ac.uk

[Joint with Michael Crowther and Lucinda Billingham]

The stiqsp command; a non-parametric approach for the simultaneous analysis of quality of life and survival data using the integrated quality survival product

In medical research, particularly in the field of cancer, it is often important to evaluate the impact of treatments and other factors on a composite outcome based on survival and quality of life data, such as a Quality Adjusted Life Year (QALY). We present a Stata program, stiqsp, which determines the mean QALY using the Integrated Quality Survival Product. In this non-parametric approach the survival function is estimated using the Kaplan-Meier method and the quality of life function is derived from the mean quality of life score at the unique death times. Confidence intervals for the QALY score are determined using the bootstrap method. We illustrate the features of the command with a large dataset of patients with lung cancer.


Robert Grant, St Georges, University of London & Kingston University
robert.grant@sgul.kingston.ac.uk

A multiple imputation and coarse data approach to residually confounded regression models

Residual confounding is a major problem in analysis of observational data, occurring when a confounding variable is measured coarsely (censored, heaped, missing, etc.) and hence cannot be fully adjusted to obtain a causal estimate by the usual means such as multiple regression. The analysis of coarse data has been investigated by Heitjan and Rubin but methods for coarse covariates are lacking.

A fully conditional specification multiple imputation approach is possible if we are able to model: (1) the confounding variable conditional on other information in the dataset and (2) the coarsening mechanism. This provides a very flexible framework for removing residual confounding under our assumptions, including sensitivity analysis. An added complexity over missing data is that it may not be known which observations are coarsened

Programming this method is presented in Stata for various combinations of (1) and (2) above, using the ml and mi suites of functions. In the simplest case of a normally distributed confounder subject to known interval-censoring, intreg and mi can be applied. The method is illustrated with simulated data and the true causal effect is recovered in each instance.


Richard Hooper, Barts & The London School of Medicine & Dentistry, QMUL
r.l.hooper@qmul.ac.uk

Sample size by simulation for clinical trials with survival outcomes: the simsam package in action

The simsam package, released earlier this year, allows versatile sample size calculation (calculating sample size required to achieve given statistical power) using simulation, for any method of analysis under any statistical model that can be programmed in Stata. Simulation is particularly helpful in situations where formulae for sample size are approximate or unavailable. The usefulness of the simsam package will depend in part, however, on the quality and variety of simsam applications developed by Stata users.

I will discuss simsam applications for clinical trials with time-to-event or survival outcomes. Here sample size formulae are the subject of ongoing research (available methods for cluster-randomised trials with variable cluster size, for example, are only approximate). Using such examples I will illustrate general simsam programming considerations such as dealing with analyses which fail, and the advantages of modular programming. Simulation forces us to think carefully about trial definitions - whether recruitment ends after a fixed time or a fixed number of recruits, for example - and to relate the sample size calculation to a detailed or contingent analysis plan. Such careful attention to detail may be lost in a formulaic approach to sample size calculation. Simulation also allows us to check for bias in a test as well as power.  But simulation is not without problems: while a rare (say one in a million) failure of an analysis in Stata may not worry the pragmatic statistician, a simsam application must anticipate this failure or risk stumbling on it in the course of many simulations.


Gordon Hughes, University of Edinburgh
G.A.Hughes@ed.ac.uk

Estimating spatial panel models using unbalanced panels

Econometricians have begun to devote more attention to spatial interactions when carrying out applied econometric studies. In part, this is motivated by an explicit focus on spatial interactions in policy formulation or market behaviour, but it may also reflect concern about the role of omitted variables which are or may be spatially correlated.

The Stata user-written procedure xsmle has been designed to estimate a wide range of spatial panel models including spatial autocorrelation, spatial Durbin and spatial error models using maximum likelihood methods. It relies upon the availability of balanced panel data with no missing observations. This requirement is stringent but it is a consequence of the fact that, in principle, the values of the dependent variable for any panel unit may depend upon the values of the dependent and independent variables for all of the other panel units. As a consequence, even a single missing data point may require that all data for a time period, panel unit or variable have to be discarded.

The presence of missing data is an endemic problem for many types of applied work, often because of the creation or disappearance of panel units. At the macro level, the number and composition of countries in Europe or local government units in the United Kingdom has changed substantially over the last three decades. In longitudinal household surveys new households are created and old ones disappear all of the time. Restricting the analysis to a subset of panel units that have remained stable over time is a form of sample selection whose consequences are uncertain and which may have statistical implications that merit additional investigation.

The simplest mechanisms by which missing data may arise underpin the missing at random (MAR) assumption. When this is appropriate, it is possible to use two approaches to estimation with missing data. The first is either simple or, preferably, multiple imputation which involves the replacement of missing data by stochastic imputed values. The Stata procedure mi can be combined with xsmle to implement a variety of estimates that rely upon multiple imputation. While the combination of procedures is relatively simple to estimate, practical experience suggests that the results can be quite sensitive to the specification that is adopted for the imputation phase of the analysis. Hence, this is not a one size fits all method of dealing with unbalanced panels, as the analyst must give serious consideration to the way in which imputed values are generated.

The second approach has been developed by Pfaffermayr. It relies upon the spatial interactions in the model which mean that the influence of the missing observations can be inferred from the values taken by non-missing observations. In effect, the missing observations are treated as latent variables whose distribution can be derived from the values of the non-missing data. This leads to a likelihood function which can be partitioned between missing and non-missing data and, thus, used to estimate the coefficients of the full model. The merit of the approach is that it takes explicit account of the spatial structure of the model. However, the procedure becomes computationally demanding if the proportion of missing observations is too large and, as one would expect, the information provided by the spatial interactions may not be sufficient to generate well-defined estimates of the structural coefficients. The missing at random assumption is crucial for both of these approaches, but it is not reasonable to rely upon it when dealing with the birth or death of distinct panel units.

A third approach, which is based on methods used in the literature on statistical signal processing, relies upon reducing the spatial interactions to immediate neighbours.  Intuitively, the basic unit for the analysis becomes a block consisting of a central unit (the dependent variable) and its neighbours (the spatial interactions). Since spatial interactions are restricted to within-block effects, the population of blocks can vary over time and standard non-spatial panel methods can be applied.

The presentation will describe and compare the three approaches to estimating spatial panel models as implemented in Stata as extensions to xsmle.  It will be illustrated by analyses of (a) state data on electricity consumption in the US, and (b) gridded historical data on temperature and precipitation to identify the effects of El Niño (ENSO) and other major weather oscillations.


Adam Jacobs, Dianthus
ajacobs@dianthus.co.uk

TBC


Stephen P. Jenkins, London School of Economics
s.jenkins@lse.ac.uk

A Monte-Carlo analysis of multilevel binary logit model estimator performance

Social scientists are increasingly fitting multilevel models to datasets in which a large number of individuals (N ~ several thousands) is nested within each of a small number of countries (C ~ 25). The researchers are particularly interested in ‘country effects’, as summarised by either the coefficients on country-level predictors (or cross-level interactions) or the variance of the country-level random effect. Although questions have been raised about the potentially poor performance of estimators of these ‘country effects’ when C is ‘small’, this issue appears not to be widely appreciated by many social scientist researchers. Using Monte-Carlo analysis, I examine the performance of two estimators of a binary dependent two-level model using a design in which C = 5(5)50 100 and N = 1000 for each country. The results point to (a) the superior performance of adaptive quadrature estimators compared to PQL2 estimators, and (b) poor coverage of estimates of ‘country effects’ in models in which C ~ 25, regardless of estimator. The analysis makes extensive use of xtmelogit and simulate, and user-written commands such as runmlwin, parmby, and eclplot. Issues associated with having extremely long runtimes are also discussed.


Paul Lambert, University of Leicester
pl4@leicester.ac.uk

Estimating and modelling cumulative incidence functions using time-dependent weights

Competing risks occur in survival analysis when a subject is at risk of more than one type of event. A classic example is when there is consideration of different causes of death. Interest may lie in the cause-specific hazard rates which can be estimated using standard survival techniques by censoring competing events. An alternative measure is the cumulative incidence function (CIF) which gives an estimate of absolute or crude risk of death accounting for the possibility that individuals may die of other causes. Geskus (Cause-specific cumulative incidence estimation and the Fine and Gray model under both left truncation and right censoring. Biometrics, 2011, 67(1): 39-49) has recently proposed an alternative way for the estimation and modelling of the CIF that can use weighted versions of standard estimators.

I will describe a Stata command, stcrprep, that restructures survival data and calculates weights based on the censoring distribution. The command is based on the R command crprep, but I will describe a number of extensions that enables the CIF to be modelled directly using parametric models on large datasets. After restructuring the data and incorporating the weights, sts graph can be used to plot the CIF and stcox can be used to fit a Fine and Gray model for the CIF. An advantage of fitting models in this way is that it is possible to utilize a number of the standard features of the Cox model. For example, using Schoenfeld residuals to visualize and test the proportional sub-hazards assumption.
I will also describe some additional options that are useful for fitting parametric models and for use with large data sets. In particular I will describe how the flexible parametric survival models estimated using stpm2 can be used to directly model the cumulative incidence function. An important advantage is that all the predictions build into stpm2 can be used to directly predict the CIF, subdistribution hazards etc.


Adrian Mander, MRC Unit, Cambridge
adrian.mander@mrc-bsu.cam.ac.uk

[Joint with Simon Bond]

Adaptive dose-finding designs to identify multiple doses that achieve multiple response targets

Within drug development it is crucial to find the right dose that is going to be safe and efficacious, this is often done within early phase II clinical trials. The aim of the dose-finding trial  is to understand the relationship between the dose of drug and the potential effect of the drug. Increasingly, adaptive designs are being used in this area because they allow greater flexibility for dose exploration as compared to traditional fixed dose designs.

An adaptive dose-finding design usually assumes a true non-linear dose-response model and select doses that either maximise the determinant of information matrix of the design (D-optimality) or minimise the variance of the predicted dose that gives a targeted response. Our design extends the predicted dose methodology, in a limited number of patients (40), to finding two targeted doses: a minimally effective dose; and a therapeutic dose. In our trial doses are given intravenously so theoretically doses are continuous and the response is assumed to be a normally distributed continuous outcome.

Our design has an initial learning phase where pairs of patients are assigned to five pre-assigned doses. The next phase is fully sequential with an interim analysis after each patient to determine the choice of dose based on the optimality criterion to minimise the determinant of the covariance of the estimated target doses. The dose-choice algorithm assumes that a specific parametric dose-response model is the true relationship, and so a variety of models are considered at the interim and human judgement involved in the overall decision.

I will introduce a Mata command that uses the optimise function to find the estimated parameters of the model and to subsequently find the optimal design. Simulated results show that assuming a model with a small number of parameters (?3) leads to a choice of doses that are not near to the target doses and over-rely on interpolation under the modelling assumptions. Fitting models with greater flexibility with more parameters (?4) results in a choice of doses near to the two target doses. Overall the design is efficient and seamlessly combines the initial learning and subsequent confirmatory stages.


Roger B. Newson, National Heart and Lung Institute, Imperial College London.
r.newson@imperial.ac.uk

Creating factor variables in resultssets and other datasets

Factor variables are defined as categorical variables with integer values, which may represent values of some other kind, specified by a value label. We frequently want to generate such variables in Stata datasets, especially resultssets, which are output Stata datasets produced by Stata programs, such as the official Stata statsby command and the SSC packages parmest and xcontract. This is because categorical string variables can only be plotted after conversion to numeric variables, and because these numeric variables are also frequently used in defining a key of variables, which identify observations in the resultsset uniquely in a sensible sort order. The sencode package is downloadable, and frequently downloaded, from SSC, and is a “super” version of encode, which inputs a string variable and outputs a numeric factor variable. Its added features include a replace option allowing the output numeric variable to replace the input string variable, a gsort() option allowing the numeric values to be ordered in ways other than alphabetical order of the input string values, and a manyto1 option allowing multiple output numeric values to map to the same input string value. The sencode package is well–established, and has existed since 2001. However, some tips will be given on ways of using it that are not immediately obvious, but which the author has found very useful over the years when mass–producing resultssets. These applications use sencode with other commands, such as the official Stata command split and the SSC packages factmerg, factext and fvregen.


Irene Petersen and Cathy Welch, UCL Department of Primary Care and Population Health
i.petersen@ucl.ac.uk
catherine.welch@ucl.ac.uk

Multiple imputation of missing data in longitudinal health records

Electronic health records are increasingly used for epidemiological and health service research. However, missing data is often an issue when dealing with this type of data. Up to now various approaches have been used to overcome these issues including complete case analysis, last observation carried forward and multiple imputation. In this presentation we will first highlight the issues of missing data in longitudinal records and provide examples of the limitations of standard methods of multiple imputation. We will then demonstrate the new twofold user written Stata command that implements the two-fold fully conditional specification (FCS) multiple imputation algorithm in Stata (Nevalainen, Kenward, and Virtanen, Missing values in longitudinal dietary data: A multiple imputation approach based on a fully conditional specification, Stat Med., 2009, 28(29): 3657-3669.)

In the application of the two-fold FCS algorithm we divide time into equal size time blocks. The algorithm then imputes missing values in the longitudinal data, imputing one time block, and then the next. The defining characteristic is that when imputing missing values at a particular time block, only measurements at that time block and adjacent time blocks are used. This obviates some of the principal difficulties which are typically encountered when attempting to apply a standard MI approach to imputing such longitudinal data.

We illustrate how the two-fold FCS MI algorithm works in practice and maximises the use of data available, even in situations where measurements are only made on a relatively small proportion of individuals in each time block. We discuss some of the strengths and limitations of the two-fold FCS MI algorithm, and contrast it with existing approaches to imputing longitudinal data. Lastly, we present results demonstrating the potential for efficiency gain through use of the two-fold approach compared to a more conventional aseline MI approach.


Sergiy Radyakin, The World Bank
sradyakin@worldbank.org

usecspro: importing CSPro hierarchical datasets to Stata

Hierarchical datasets are commonly a product of the popular CSPro system developed by the US Census Bureau. CSPro became widely popular and a de facto standard for data collection in many countries, some agencies supply data exclusively in CSPro format. While CSPro can export the data into Stata format on its own, the procedure compromises on some features and requires the user to run CSPro, which operates in MS Windows only.

The new module for Stata usecspro allows easy import of the hierarchical datasets into Stata by automatically parsing the CSPro dictionary files. The conversion procedure is implemented in Mata and allows importing data from any level and any record of the hierarchical CSPro dataset. It also preserves the variable and value labels, takes care of the missing values and other common concerns during the data conversion.


Philippe Van Kerm, CEPS/INSTEAD, Luxembourg
philippe.vankerm@ceps.lu

Repeated half-sample bootstrap resampling

This presentation illustrates Stata implementation of the repeated half-sample bootstrap proposed by Saigo et al. (Survey Methodology 2001). This resampling scheme is easy to implement and is appropriate for complex survey designs, even with small stratum sizes. The user-written command rhsbsample mimicks the official bsample and can be used for bootstrap inference in a wide range of settings.


Ian White, MRC Biostatistics Unit, Cambridge
ian.white@mrc-bsu.cam.ac.uk

A suite of Stata programs for network meta-analysis

Network meta-analysis involves synthesising the scientific literature  comparing several treatments. Typically, two-arm and three-arm randomised trials are synthesised, and the aim is to compare treatments that have not been directly compared, and often to rank the treatments. A difficulty is that the network may be inconsistent, and ways to assess this are required.

In the past, network meta-analysis models have been fitted using Bayesian methods, typically in WinBUGS. I have recently shown how they may be expressed as multivariate meta-analysis models and hence fitted using mvmeta. However, various challenges remain, including getting the data set up in the correct format; parameterising the inconsistency model; and making good graphical displays of complex data and results. I will show how a new suite of Stata programs, network, meets these challenges, and I will illustrate its use with examples.





 

Back to top 

Return to  19th London Stata Users Group Meeting  |  Stata  |  Home

 

 
Newsletter Registration