Training Calendar

School on Data Science and Causal Modelling for Big Data in Health and Medical Research

Cass Business School 4 days (14th October 2019 - 17th October 2019) Stata Advanced, Intermediate
Big Data, Casual Analysis, Medical statistics


Health and medical big data offer today an expanding opportunity for researchers and physicians to exploit new sources of information regarding health-related phenomena. This is possible thanks to an unprecedented amount of new information collected for clinical and administrative purposes at hospital, individual, and even sub-individual (as gene-based) level of detail.

Today health information technology (HIT) based on larger computing power and storage, along with recent developments in data science and visualization, statistical algorithms, and causal modelling, make the analysis of health and medical big data (at any level) easier to carry out than it was just few years ago. This opens up new opportunities for predicting patients’ responses to selected exposures, for detecting patterns and regularities underlying administrative data, to monitor and assess costs and benefits of health-related programs and treatments, and finally to discover etiological/causal patterns brought about by specific treatments.

All this has been eased also by the development of new statistical methods able to “learn” from massive sets of data (machine learning techniques), as well as to the extraordinary advancements in causal inference and causal modelling that took place in the last decade. In the health and medical contexts, these approaches are now indispensable prerequisites for an accurate data mining, for improving predictive power and factor-importance detection, for the cost-benefit analysis of health programs, and for causal path analysis.

Clinical contexts where using ML techniques may come in handy are many. Some examples are:

  • Increasing the accuracy in predicting transplant rejection probabilities given idiosyncratic and environmental characteristics of patients recorded in organ transplantation waiting lists.
  • Detecting factors’ relative-importance in driving successful cancer preventing screening;
  • Identifying the most influential sequences of genes likelier to be associated with specific diseases;
  • Tracing dose-response patterns for predicting patients’ resilience to different treatment types over different treatment intensities.

For administrative purposes, these techniques make it possible for policymakers at institutional level, or managers at hospital and clinic level, to better explore the relationship between costs and benefits of new health programs, or changes in hospital management practices. Indeed, modern causal inference techniques are able to estimate counterfactual net benefit functions able to gauge what would have been the cost/benefit balance with and without the implemented changes.

The objective of this summer school is therefore to train participants in data science, and causal modelling and inference for health and medical big data. The main subjects that the course will cover include: (i) identification of the main sources of administrative and clinical data; (ii) management and manipulation of these data using Stata; (iii) correct application of machine learning and data mining techniques; (iv) accurate application of causal inference methods, via both counterfactual and structural modelling; (v) ex-post cost-benefit analysis for health policy programs.

In line with the general philosophy of our training courses, the lessons will be very interactive and will have mostly applied content. The lessons will include numerous empirical applications on health and medical data. Participants will be able to experiment with the techniques learned through exercises performed by their own calculation stations under the guidance of the instructor.


Day 1

Session 1: Management and exploration of administrative and clinical health care data

  • Data for health statistics: definitions, types, structures
  • Administrative health care data for pharmacoepidemiology, and clinical medicine
  • Electronic medical record and health care claims data
  • Data management: organization, cleaning, manipulation, and descriptive analysis
  • Examples and practice using Stata

Session 2: Practicing traditional inference methods

  • Sampling theory: a brief review
  • Testing hypotheses for randomized and nonrandomized medical trials and programs
  • Exposure/response models for continuous, binary, countable, and fractional responses
  • Survival-time data and modelling using patient/disease registries
  • Examples on real health data and practice using Stata

Day 2

Session 1: An introduction to Machine learning

  • Supervised vs. unsupervised learning
  • Regression vs. classification problems
  • Inference vs. prediction
  • Sampling vs. specification error
  • Parametric vs. non-parametric learning
  • Measuring the quality of fit: in-sample vs. out-of-sample prediction power
  • The bias-variance trade-off
  • Model specification and validation: bootstrap and cross-validation
  • An overview of Machine Learning methods

Session 2: Machine learning applications using Stata

Addressing classification problems

  • Bayes classifier
  • Logistic regression
  • Discriminant Analysis
  • K-nearest neighbours

Addressing regression problems

  • Optimal subset selection
  • Lasso, Ridge, and Elastic regression
  • Non-parametric regression

Day 3

Session 1: Causal inference for treatment effects

  • Observational data and counterfactual causality
  • Rubin’s potential outcome model
  • Assumptions for treatment effects identification
  • Causal inference estimation methods
  • Survival-time treatment effect modelling
  • Machine learning and causal inference

Session 2: Treatment effect applications with Stata

  • Syntax and application of the Stata command teffects
  • Syntax and application of the Stata command stteffects
  • Ex-post cost-benefit analysis of health programs: an application

Day 4

Session 1: Structural causal modelling

  • Structural equation modelling (SEM) in health and medical applications
  • Model specification, estimation, and validation
  • Modelling moderation and mediation
  • Direct, indirect, and total effects
  • Results interpretation

Session 2: Applications to health care and medical data using Stata

  • The language of SEM
  • The syntax of the Stata package sem
  • The SEM builder to create models manually
  • Applications on real and artificial datasets

Pre-course Reading

Machine learning and data mining

  • Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani (2013), An Introduction to Statistical Learning with Applications in R, Springer, New York, 2013.
  • Trevor Hastie, Robert Tibshirani, and Jerome Friedman (2008), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Second edition, Springer.

Causal modelling

  • Acock A.C., Discovering Structural Equation Modeling Using Stata, Stata Press, Revised Edition, 2013.
  • Cerulli, G. (2015), Econometric Evaluation of Socio-Economic Programs: Theory and Applications, Springer.

Post-course Suggested Reading

  • Cameron, A.C. and Trivedi, P.K. (2009), Microeconometrics using Stata, Stata Press. Chapter. 4

DAILY TIMETABLE (subject to minor changes)

09:00 - 09:20 Registration

09:30 - 11:00 Session 1a

11:00 - 11:15 Tea / coffee break

11:15 - 12:45 Session 1b

12:45 - 14:00 Lunch

14:00 - 15:15 Session 2a

15:15 - 15:30 Tea / coffee break (Feedback Session)

15:30 - 17:00 Session 2b


  • Basic knowledge of descriptive and inferential statistics.
  • Basic knowledge of the Stata software
  • All costs exclude local taxes, where applicable.
  • Student registrations: Attendees must provide proof of full time student status at the time of booking to qualify for student registration rate (valid student ID card or authorised letter of enrollment).
  • Additional discounts are available for multiple registrations.
  • Cost includes course materials, lunch and refreshments.
  • If you need assistance in locating hotel accommodation in the region, please notify us at the time of booking.
  • Payment of course fees required prior to the course start date.
  • Registration closes 5-calendar days prior to the start of the course.
    • 100% fee returned for cancellations made over 28-calendar days prior to start of the course.
    • 50% fee returned for cancellations made 14-calendar days prior to the start of the course.
    • No fee returned for cancellations made less than 14-calendar days prior to the start of the course.

The number of delegates is restricted. Please register early to guarantee your place.

  •  CommercialAcademicStudent
    2-day pass (14/10/2019 - 17/10/2019)
    3-day pass (14/10/2019 - 17/10/2019)
    4-day pass (14/10/2019 - 17/10/2019)
    1-day pass (14/10/2019 - 17/10/2019)

All prices exclude VAT or local taxes where applicable.

* Required Fields

Post your comment

Timberlake Consultants