Recent years have witnessed an unprecedented availability of information on social, economic, and health-related phenomena. Researchers, practitioners, and policymakers have nowadays access to huge datasets (the so-called “Big Data”) on people, companies and institutions, web and mobile devices, satellites, etc., at increasing speed and detail.
Machine learning is a relatively new approach to data analytics, which places itself in the intersection between statistics, computer science, and artificial intelligence. Its primary objective is that of turning information into knowledge and value by “letting the data speak”. To this purpose, machine learning limits prior assumptions on data structure, and relies on a model-free philosophy supporting algorithm development, computational procedures, and graphical inspection more than tight assumptions, algebraic development, and analytical solutions. Computationally unfeasible few years ago, machine learning is a product of the computer’s era, of today machines’ computing power and ability to learn, of hardware development, and continuous software upgrading.
This course is a primer to machine learning techniques using Stata. Stata owns today various packages to perform machine learning which are however poorly known to many Stata users. This course fills this gap by making participants familiar with (and knowledgeable of) Stata potential to draw knowledge and value from rows of large, and possibly noisy data. The teaching approach will be based on the graphical language and intuition more than on algebra. The training will make use of instructional as well as real-world examples, and will balance evenly theory and practical sessions.
After the course, participants are expected to have an improved understanding of Stata potential to perform marching learning, thus becoming able to master research tasks including, among others: (i) factor-importance detection, (ii) signal-from-noise extraction, (iii) correct model specification, (iv) model-free classification, both from a data-mining and a causal perspective.
Session 1: The Basics of Machine Learning
Machine Learning: definition, rational, usefulness
- Supervised vs. unsupervised learning
- Regression vs. classification problems
- Inference vs. prediction
- Sampling vs. specification error
Coping with the fundamental non-identifiability of E(y|x)
- Parametric vs. non-parametric models
- The trade-off between prediction accuracy and model interpretability
- Measuring the quality of fit: in-sample vs. out-of-sample prediction power
- Goodness-of-fit indices
- The bias-variance trade-off and the Mean Square Error (MSE) minimization
Session 2: Simulation, Resampling, and Validation Methods
Monte Carlo simulations
- Logic and functioning of a Monte Carlo experiment
- Implementing Monte Carlo experiments via simulate and postfile
- The logic of the Bootstrap
- Bootstrapping standard errors via bootstrap
- The validation set approach
- Leave-One-Out Cross-Validation
- K-fold cross-validation
- The Stata package crossfold
Session 3: Non-parametric Regression - Local Methods
- Beyond parametric models: the “why” and the “how”
- Type of non-parametric regressions: local vs global approaches
- Nearest-neighbor regression
- Kernel-based regression
- The Stata npregress command
Session 4: Non-parametric Regression - Global Methods
Monte Carlo simulations
- Polynomial and series regression with bfit
- Spline regression with mkspline
- Generalized additive models with gam
Session 5: Model Selection and Regularization
- Optimal subset selection with combinatorics
- Lasso, Ridge, and Elastic regression with lassopack
Session 6: Tree-based Regression
- Model uncertainty and credibility
- The LOCO sensitivity algorithm via sensimatch
- An introduction to Regression Trees
- Bagging, Random Forests, and Boosting
- The R-based Stata command stree
Pre-course Reading List
- Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani (2013), An Introduction to Statistical Learning with Applications in R, Springer, New York, 2013. ISBN # 978-1-4614-7137-0. See Amazon for hardcover or eTextbook.
Post-course Reading List
- Trevor Hastie, Robert Tibshirani, and Jerome Friedman (2008), The Elements of Statistical Learning: Data Mining, Inference, and Prediction, second edition, Springer
- It is required some knowledge of basic statistics and econometrics: notion of conditional expectation and related properties; point and interval estimation; regression model and related properties; probit and logit regression.
- Basic knowledge of the Stata software
Terms and Conditions
- Student registrations: Attendees must provide proof of full time student status at the time of booking to qualify for student registration rate (valid student ID card or authorised letter of enrolment).
- Additional discounts are available for multiple registrations.
- Cost includes course materials, lunch and refreshments.
- Delegates are provided with temporary licences for the software(s) used in the course and will be instructed to download and install the software prior to the start of the course. (Alternatively, laptops can be hired for a fee of £10.00 (ex. VAT) per day).
- If you need assistance in locating hotel accommodation in the region, please notify us at the time of booking.
- Payment of course fees required prior to the course start date.
- Registration closes 5-calendar days prior to the start of the course.
- 100% fee returned for cancellations made over 28-calendar days prior to start of the course.
- 50% fee returned for cancellations made 14-calendar days prior to the start of the course.
- No fee returned for cancellations made less than 14-calendar days prior to the start of the course.
The number of delegates is restricted. Please register early to guarantee your place.