Stata Tips #19 - Multilevel Tobit regression models in Stata 15
Multilevel Tobit regression models in Stata 15
Tobit models are made for censored dependent variables, where the value is sometimes only known within a certain range. Chemical sensors may have a lower limit of detection, for example. In this blog post, we'll use some simulated data so that we know what relationships we expect to see, and they will be censored with an upper limit, or as the jargon goes, right-censored.
Tobit models have been available in Stata for a while, but version 15 now includes multilevel versions with random intercepts and random slopes. Let's look at a simple mathematical representation.
In these equations, i indexes the individual cases and j the clusters of cases that define the multilevel structure. That might be exam results clustered by student, or trees clustered by plantation, or whatever. One neat way of thinking about the advantage of multilevel models is that they split up the unexplained variance into observation-level and cluster-level variance, and in doing so, give deeper understanding and more nuanced conclusions about the data.
x is an independent variable or predictor, y is the dependent variable or outcome of interest, but we don't observe y directly, only y*, which is the censored version at a threshold h. The betas are simple regression coefficients. Then, there is also a random intercept u, which is different for every cluster in the data's multilevel structure, and a random slope v, which contributes to the effect of x on y. The u and v values are normally distributed around zero, by convention.
Let's create some simulated data for an imaginary research study into rehabilitation for amputees learning to use electrically-controlled prosthetic hands. The data are fake but inspired by real-life research.
In these data, there are two training programmes being compared (variable: train) and the participants (identified by the variable: id) are tracked over several weeks (variable: week). They have to complete a test by carrying out four tasks of increasing difficulty (variable: task), and they are timed on the tasks (variable: time), so smaller time is desirable, but there is a maximum time of 30 seconds, so if they have not completed it within 30 seconds, then time is recorded as 30.
This is right-censoring of the dependent variable. We also have to account for the repeated-measures (economists may say "panel data") nature of the data in a multilevel model. There will be a random intercept by id, and a random slope for week, and the study is principally interested in the interaction between week and train, because this would indicate a difference between training programmes in their impact on the rate of recovery (slope relating week to time).
The code to generate these data is in the accompanying do-file. The graphs below show the individual "patients" as lines over time, split by the four tasks and two training groups.
If only we knew the real values of the dependent variable...
...then we could just fit a familiar multilevel linear regression. Let's call it "Model 1".
mixed y c.week##i.train i.task || id: week, covariance(unstructured)
You can see that this recovers the population parameters we fed into the simulated data, plus-minus some sampling error. Along with the other two models we'll fit, they are summarised in this table:
Next, we try the same linear regression, having censored our dependent variable above 30 seconds. The only thing we change is to replace y with ystar, which has any values above 30 replaced with exactly 30. This is "Model 2", and it effectively ignores the censoring, so we shouldn't be too surprised to get biased results out. In the table above, the highlighted cells show those parameters whose Model 1 values lie outside the 95% confidence intervals from Model 2.
mixed ystar c.week##i.train i.task || id: week, covariance(unstructured)
Many of these biases can be understood in terms of the data being pushed down to the red dashed line in the graphs above, and the linear model trying to compensate for that. But, when we use
metobit (which has almost identical syntax to
mixed) to fit "Model 3", we are asking Stata to integrate out the censored values over the part of the data distribution above the censoring threshold. That requires some assumptions in the model, and of course, this linear and normally distributed dataset fits those assumptions nicely, but in any situation where you are happy to use linear regression with the usual assumption of normal and homoscedastic residuals, it will be justified. Here, the biases of Model 2 all disappear, as seen in the table.
metobit ystar c.week##i.train i.task || id: week, ul(30) covariance(unstructured)
We can use
marginsplot to help us interpret the model. You might notice that
margins takes much longer with
metobit than with
mixed: we are asking a much harder task of the computer!
From this graph, it's obvious that the bias in Model 2 has been very well removed by the multilevel Tobit in Model 3.