# 2017 Spanish Stata Users Group Meeting

### Now what do I do with this function?

Pinzón, Enrique
StataCorp
Nonparametric analysis has been traditionally descriptive. We fit the regression function that relates the outcome of interest and the covariates, and then we graph. But we can go beyond the descriptive. We may use this function to compute marginal effects, counterfactuals, and other statistics of interest. In other words, we may use -margins- after -npregress- to conduct semiparametric analysis. I will show you how.

### Latent class analysis and finite mixture models with Stata

Isabel Cannete
StataCorp
Sometimes, we are interested in identifying and understanding different groups in a population, even though we cannot directly observe which group each individual belongs to. Latent class analysis deals with these problems. Often, those classes are determined by heterogeneity on regression models, where the relationship of a dependent variable (or variables) with a group of covariates varies from group to group. The new features added in Stata 15 to the gsem command allow us to fit a wide array of latent class models. Special cases of those are finite mixture models, which can also be fit using the new prefix fmm. We will introduce these topics and discuss examples using Stata.

### Generalized linear models (GLM) applied to the prediction of health expenditure

Vicente Caballer, David Vivas, N. Guadalajara, Alexander Zlotnik, Isabel Barrachina
Objectives: To implement a health expenditure prediction system based on morbidity and to analyze its goodness of fit. Methods: Observational, descriptive, retrospective and cross-sectional study on total health expenditure using explanatory-predictive-stratified models. There was a database of 156,811 inhabitants of the Denia health department, which included age, Clinical Risk Group (CRG), total health expenditure, among other variables. The GLM with logarithmic-gamma distributions have different iterations, depending on the dependent variable, the total health expenditure and as independent variables; Age, gender and membership of the CRG in order to select the model that best explains the behavior of health expenditure. The model with the highest statistical significance was the one that used the combination of the variables age, sex, CRG health status and severity level, whose Akaike information criterion was 14.2. By correlating the values ​​estimated by the model and the real value, a correlation of 25% was obtained. Differing by type of expenditure, CRG showed a greater explanatory capacity in outpatient pharmaceutical spending and a lower in hospital expenditure. Conclusions: Multimorbidity factors have a greater impact on the explanation of health expenditure than demographic variables.

### Dealing with missing data in practice: methods, applications and implications for HIV cohort studies

Belén Alejos Ferreras
Missing data are common in HIV cohort studies, affecting both the covariates and the outcome. In this case study, we compare different methods to deal with missing data applied to estimate mortality by Hepatitis C virus coinfection in the cohort of the Spanish Network of HIV Research (CoRIS) using Stata. Poisson regression was used to estimate Mortality Rate Ratios, using 5 methods to handle missing data in both the covariates and the cause of death: complete-case, indicator method (IM), multiple imputation by chained equations (MICE), multiple imputation then deletion (MID) and inverse probability weighting (IPW). Strong predictors were found for incomplete variables’ values and for their probability of being missing. No significant differences were found in excess hazard ratios between the different methods. However, complete-case approach lead to less precise estimations; and incorrect classification of cause of death or deletion of cases with missing cause of death when using complete-case, MID, IPW lead to underestimation of the excess mortality rates. In this case-study, MICE seemed to work best because it both corrected bias and produced most accurate estimates. Although MICE rests on the untestable assumption that data are missing at random, it seemed plausible in this context.

### Statistical analysis of the evolution of autonomous symptoms in patients with Parkinson

Abelardo Fernández Ch.
Hospital Ramón y Cajal
Introduction: Autonomic symptoms (AS) of Parkinson's disease (PD) may be present from the time of diagnosis and even precede it. So far there are many unknowns in the relationship between the evolution of AS and other variables of the disease. Objectives: To describe the evolution of AS and its relationship with motor and non-motor symptoms in PD. Methods: Observational, multicentric study (Spain and Holland) with longitudinal follow-up, with baseline evaluation and in the fourth year. The SCOPA-Motor, SCOPA-Cognition and HY stage scales were used, and SCOPA-AUT (SCOPA-AUT) and SCOPA-Sleep self-administered questionnaires were used. Statistical magnitudes (the size effect and relative change) and sensitivity to change (standard error of the measure, 10% of the total maximum score and ½ standard deviation) were calculated for each subscale of the SCOPA-AUT and their mean value (Estimated value of change, EVC). Patients were classified according to their worsening in the autonomic subscales, depending on whether the difference between baseline and follow-up score exceeded or not.

### Construction and validation of a predictive model for the identification of complex chronic patients

Silvia Badal, Alexander Zlotnik, Ruth Usó, David Vivas
The objective of this study was the construction and validation of a predictive model for the identification of complex chronic patients. A cross-sectional study was performed on the population of the Comunidad Valenciana region in 2015 (4,708,754 persons). Dependent variable: resource use variables greater than or equal to the 95th percentile (P95), including the number of primary contacts, number of hospital admissions, number of visits to emergency departments, pharmaceutical problems and the pharmaceutical costs. Predictive variables: age, morbidity (according to clinical risk groups (CRG)) and variables corresponding to the resource use mentioned above. Persons exceeding P95 were 0.2% of the population, thus the study was carried out on a sample of 10% stratified by CRGs and all persons without chronic or moderate conditions were eliminated, i.e., those belonging to health states 1, 2, 3 and 4, obtaining a total of 150,252 persons. Then, a logistic regression model was built. Its validity was analyzed with sensitivity, specificity, goodness of fit test and area under the ROC curve (AUC) metrics

### Methodology for using Weka machine learning algorithms from Stata

Alexander Zlotnik
Stata does not include most classical machine learning algorithms in its core libraries. A few algorithms are available through plug-ins, such as wrappers for the LIBSVM library; however, these sometimes exhibit performance problems, do not expose the full functionality of the algorithm and are often challenging to modify. Weka is an open source software suite written in Java which implements most well-known machine learning algorithms. Since its source code is available and documented, it is relatively easy to introduce custom modifications that should fulfill most practical business and research needs.  In this talk we present a simple method for integrating Stata ADO-file programs with standard Weka algorithms (CART and C4.5 decision trees, support vector machines, neural networks, Bayesian networks, KNNs, LogitBoost classifiers, stacking classifiers, generic ensemble classifiers, et cetera) as well as custom Weka algorithms (such as CART trees with LogitBoost on its branches).

### A proposal for a new Stata licensing scheme based on blockchain, cloud computing and grid computing

Alexander Zlotnik
David Arroyo Manzano
Stata is a well-known statistical software package used for a wide variety of statistical analyses. As Stata users in several fields know, current and future data processing often require increasingly larger computing resources. Stata/MP is a multiprocessor version suitable for some of these tasks but, even on powerful hardware, its capacities are sometimes surpassed for computationally demanding tasks. If these tasks can be parallelized, distributed computing approaches can also be used. Some software packages that require powerful computing resources, such as ChessBase engines used for deep chess variant analysis, have introduced the possibility of offloading some calculations to its private clouds. Alternatively, large computational problems such as the SETI@home project have chosen a grid computing model. The latter approach could be further enhanced with a blockchain-based distributed ledger that registered the computational power contributed to the community by each of its members and rewarded them for their contribution. All of these approaches or combinations of thereof, could be used for new Stata licencing schemes.

### `sdmxuse`. Module to import data from statistical agencies using the SDMX standard

Sébastien Fontenay
Université Catholique de Lovaine
SDMX, which stands for Statistical Data and Metadata eXchange, is an ISO standard developed by seven international organizations (BIS, ECB, Eurostat, IMF, OECD, the United Nations and the World Bank) to facilitate the exchange of statistical data. The package sdmxuse (available from the SSC archive) allows Stata users to download and import SDMX data directly within their favorite software. The program builds and sends a query to the statistical agency (using RESTful web services), then imports and formats the downloaded dataset (in XML format). The complex structure of the datasets (so-called “cube”) is reviewed to show how users can send specific queries and import only the required time series. sdmxuse might prove useful for researchers who need frequently updated time series and wish to automate the downloading and formatting process.

### Using Stata to estimate dynamic binary random effects models with unbalanced panels

Pedro Albarrán, Raquel Castro and Jesús M. Carro
The purpose of this paper is to deal with the implementation in Stata of estimators porposed by Albarrán et al. (2017) for dynamic binary choice correlated random effects (CRE) models with unbalanced panel data. The procedure allows for unrestricted correlation between the sample selection process that determines the unbalancedness and the time invariant unobserved heterogeneity. We create a specific command for this procedure, named xtunbalmd. It performs the estimation of the model for each subpanel separately and obtain estimates of the common parameters across subpanels by minimum distance (MD). This estimation method is faster than the estimation by Maximum Likelihood (ML) because it allows using the same estimation routines that we would use if we had a balanced panel, while keeping the good asymptotic properties of the ML estimator for the whole sample.

### Complementarity Analysis in Multinomial Models: The gentzkow command

Ricardo Mora and Yunrog Li
Universidad Carlos III and Southwestern University of Finance and Economics
In the presence of a choice of two binary variables, the usual econometric procedure within the framework of the random utility model is the estimation of a bivariate probit that takes into account the potential correlation between the error terms for the utilities of all different alternatives. This approach is not useful if the objective of the analysis is the study of the complementarity or substitutability of the two alternatives , since the bivariate model assumes by construction that the two alternatives are independent from the economic point of view. In other words, in the bivariate probit a factor that makes more attractive one alternative in the first choice but does not affect the utilities of the other choice directly does not induce a change in the second choice. To study the complementarity or substitutability of alternatives, it is necessary to estimate a more flexible model, such as the multinomial model and to compute expected complementarity patterns from the standard results. In this presentation, we show the module of Stata gentzkow that performs the complete analysis and we show its usefulness with an example with data from China on the double choice of grandparents to first live in the same house as their children and grandchildren and second to help with the care of the grandchild.

### Performing Probabilistic Cost-Effectiveness Analysis via decision tree modelling in Stata: The manantial command

Manuel García-Goñi and Ricardo Mora
Decision models are based on Markov processes that describe the statistical laws for possible states or sequential events to which an individual or patient is subject within a system. Every decision model can be represented as a probability tree containing nodes and branches. Each node represents a possible state of the patient in its clinical evolution and socio-economic status while each branch joins two states that are sequentially possible. Thus, from the initial node representing the patient's input to the system arise several branches representing the following different possible states. Each terminal node of the tree represents the last possible state after a particular patient evolution. Therefore, probability trees are diagrams that represent all possible evolutions of a patient within a system. By assigning net costs to each node and conditional probabilities to each branch it is possible to calculate the expected net cost per patient. Using Monte Carlo techniques, the distribution of estimated net costs per patient in the population of interest can be estimated to incorporate the uncertainty inherent in the use of estimated values for conditional probabilities and net costs. The Stata command takes as inputs the decision tree, probability distributions and payoffs. The command provides significance tests and confidence intervals and perfoms sensitivity analysis. We illustrate the use of the command with an evaluation of early intervention in psychosis. Early intervention in psychosis is a clinical approach for those who experience symptoms of psychosis for the first time. It is part of a new paradigm of prevention of psychiatry that is conditioning the reform of mental health services. The focus is on the early detection and treatment of early symptoms of psychosis during the formative years critical to the psychotic condition.

### Random samples generation with Stata from continuous and discrete distributions

Gabriel Aguilera-Venegas, José Luis Galán-García, M. Ángeles Galán-García, Pedro Rodríguez-Cielos, Ricardo Rodríguez-Cielos
Simulations are nowadays a very important way of analyzing new improvements in different areas before the physical implementation, which may require hard resources which could only be affronted in case of a high probability of success. The use of random samples from different distributions are a must in simulations. In this talk we introduce new Stata functions for generating random samples from continuous and discrete distributions that are not considered in the defined Stata random-number generation functions. In addition, we will also introduce new Stata functions for generating random samples as an alternative of the build-in Stata functions. The goodness of the generated samples will be checked using the mean squared error (MSE) of the differences between the frequencies of the sample and the theoretical expected ones. We will also provide bar charts which will allow the user to compare graphically the sample with the exact distribution function of the random distribution which is being sampled.

### Multilevel Models for Cross-Sectional and Longitudinal Data

Antonio M. Jaime-Castillo
Typical multilevel analysis in comparative research implies the use of cross-sectional data for multiple countries. Multilevel models in such settings are likely to be affected by problems of endogeneity and omitted variables biases because of unobserved heterogeneity. However, there is a growing volume of longitudinal data in comparative data projects, as they typically span multiple waves (e.g., the European Social Survey or the World Values Survey). This allows to exploit the longitudinal dimension of the data, by splitting the effect of aggregate variables into two different sources of variation (between and within countries), which makes multilevel models robust against the problem of unobserved heterogeneity. Drawing upon a few recent works in the literature that propose to include both cross-sectional and longitudinal effects in multilevel models, in this presentation I focus on the theoretical and practical implications of this modeling strategy. Furthermore, I provide some examples and practical recommendations using this approach with Stata.

### The effect of birth weight on cognitive performance: Is there a social gradient? Is there compensation?

UNED
Demography has traditionally been interested in birth weight and the impact that certain adscriptive characteristics have on birth weight. Most of the interest in this variable lies in the fact that weight at birth is a significant predictor of infant health outcomes (as well as health at adult ages), but also of cognitive performance and educational results. While evidence regarding the explanation of the prevalence of low birth weight is commonly present in different disciplines, it is much less frequent to see high-quality evidence built from large sample sizes quantifying the impact of weight at birth on schooling outcomes. In this paper we use data from the Chinese Family Panel Study (2010 wave), a large-scale representative sample of Chinese households, to model the effect of low birth weight on standardized test scores among Chinese children aged 10-15 years. Our evidence confirms a highly significant negative effect of LBW on the results obtained by children in both mathematics and Chinese language. The paper also shows a clear gradient in the prevalence of low birth weight by family background. Our evidence also implies that highly educated parents(mothers) can actually compensate the disadvantage that low birth weight represents in terms of cognitive performance.

### Education in Spain: tell me how you look at data and I will tell you what you will see

Pau Miret Gamundi
Universidad Autónoma de Barcelona, Centro de Estudios Demográficos
According to Eurostat between 2011 and 2016 early school leaving has fallen in Spain from 26.3 to 19.0%. Following this fast progression, the target for 2020 (15%) will certainly be reached soon. Using the same database it is assured that the percentage of population aged 30-34 years with tertiary studies has remained above 40%. With these indicators, we can only congratulate a society which is winning the battle against premature school leaving and that has such an abundant volume of highly qualified young people. The above information is based on the Spanish Labour Force Survey (aka, EPA-Encuesta de Población Activa), a panel survey where an individual can be observed up to six times on a row. By applying the Stata specific module for panel data analysis (xt), the real view on the level of education in Spain dramatically changes, as early school leaving is much higher and the proportion of people who completed a university degree much lower. We just need to take into account that the EPA is surveying the same individuals in different occasions.

### `svy` or `calibrate`? Survey post-adjustments in Stata

Pablo Cabrera y Modesto Escobar