Latest News

Statistics is the New Machine Learning

Machine Learning (ML) is a rapidly developing field. There is enormous competition to be the inventor of a new method that gets widely adopted, and investors stand to make a fortune from a startup that implements a successful new analytical tool. I find it fascinating to watch the emerging hot topics. Recently, I noticed a trend. Hotshot VCs, take note! (But don't get too excited yet.)

Robert Grant, Statistician, Human, Audience
Robert Grant

First, let's look at some of the latest must-have techniques.

Time series

Me: Hello, old friend. What are you doing here?

Time series: Well, it turned out that not everything in life is a flat file of independent observations, so the ML people called me in.

Me: I thought you'd retired.

Time series: Ha ha ha! No.

H2O.ai, arguably the market leader in automated machine learning (autoML), have this to say right on the front page of their website:

Award-winning Automatic Machine Learning (AutoML) technology to solve the most challenging problems, including Time Series and Natural Language Processing.

Clearly, they think time series is one of the big selling points, or they wouldn't have put it there*. It even bumped the mighty Deep Learning off the Top Two. And perhaps the reason is that those highly flexible and non-linear modelling tools, like XGBoost, take no account of dependency in the data. Looking online for (for example) "XGBoost for market tick data", you'll find plenty of people who still believe you can push any data into a sufficiently flexible algorithm and get a useful model out. Elsewhere, you can find confusion between analysis of streaming data, where sliding windows of time define batches of data, and true time series models. I would argue that, to model data that are grounded in time and/or space, you should also leverage domain knowledge: that's a phrase we'll be hearing again soon.

Vendors working on the data storage and provisioning side have also been quick to offer a range of time-series-aware database products that scale to Big Data. Just search for "scalable time series database" and you'll find plenty. Five years ago it was a concept that you could shoehorn into a more generic type of database, but not a product in itself that could attract investment and get ML pulses racing.

It's hard not to chuckle as a statistician, seeing the ML folk discover autocorrelation 100 years on, but more constructively, this can help us predict where the next ML boom might come. Dependency does not just arise from time-located data, but also spatial data, networks (social or otherwise), heterogeneous data sources, and hierarchical structures. There is a lot of investment in geographical information systems and networks in the machine learning, AI and Big Data world, so it would be reasonable to see this appearing in the near future.

* - that's assuming they didn't tailor the front page for maximum appeal, having already categorised me under Statistics (I jest)

Uncertainty

For a while, ML algorithms became enormously popular on the strength of their predictions (point estimates). Then, unease set in about the lack of uncertainty in the outputs. Suppose you have to produce a categorical prediction. The algorithm returns the category with the best objective function, and that's the only information you have. Some decisions truly are binary (buy gold, sell zinc), while others are fuzzier, but even the binary ones can often be hedged.

In their book Computer Age Statistical Inference, Brad Efron and Trevor Hastie strike a conciliatory tone between statistics and ML, suggesting that most algorithms for data analysis start out with point estimates, and later, uncertainty is added. The EM algorithm is an example.

It was perhaps the rise of self-driving cars that made the absence of uncertainty too uncomfortable. Decisions do not always have to be taken by AI; they can be referred to as the human driver (let's hope they haven't fallen asleep), but that requires a measure of uncertainty. This problem remains imperfectly solved, despite massive investment, and is likely to provide interesting job opportunities for some time to come.

Now, we can see efforts to add measures of uncertainty into ML parameter estimates and predictions. Often, those look a lot like 95% confidence or credible intervals, and there is a big advantage to adding in domain knowledge. Leading software TensorFlow Probability has this to say:

"The TensorFlow team built TFP for data scientists, statisticians, and ML researchers and practitioners who want to encode domain knowledge to understand data and make predictions."

Without statistical (i.e. mathematical models of the uncertainty, based on probability) approaches, adding uncertainty to ML requires re-fitting laborious models many times or adopting some uncomfortable approximation. Any alternative technique that provides a quantified uncertainty output alongside a point estimate, quickly and with well-established long-run properties, is going to be popular with bosses everywhere. Sounds like statistics?

Gaussian processes

These are a class of supervised learning models that use probability but are highly flexible. Taking a starting point from signal processing ideas of the mid-20th century, they envisage the variation in the data to arise from a random process. Between observations, the process moves according to some probability distribution, and Gaussian processes use the normal, or Gaussian, distribution for this, just like Brownian motion. There are also Dirichlet processes for categorical variables.

This necessitates an interesting inversion in conceiving of the estimation and inference problem. Instead of a small number of parameters deterministically leading to a large number of predictions, the data and predictions together are recast as a vector (or higher-dimensional tensor) of values that are subject to the parameters of the underlying process.

The flexibility of Gaussian processes has seen them widely applied to time series and temporo-spatial problems in recent years. For autocorrelation over more than one variable, they are faster and far more robust than the conditionally autoregressive (CAR) models of yesteryear. However, this flexibility comes at a cost, and requires considerable expertise, which remains in extreme short supply. Gaussian processes are mentioned in many data science and ML courses, but usually only at a high level. Their successful implementation requires knowledge of how to choose from the esoteric multi-dimensional priors that they require, and how to tailor these to new settings. Statistical skills, in short!

Bayesian tuning parameter optimisation

There is no free lunch, in statistics as in ML. If you want to input x1, x2, x3, etc and get a prediction of y, you will need to constrain the space of possible models so that there are a number of parameters to estimate. Non-parametric models are sometimes framed as infinite-dimensional, but the essence is the same: choose your method and let the computer find the unknown parameters that can be combined to give an output. This applies to supervised and unsupervised learning equally, because we haven't said anything about the output y being in the dataset along with the x's.

Arising from computer science, many ML methods do not explicitly state a probabilistic model with parameters, but instead try to efficiently and flexibly find a way of converting x's into y. To get that right, there are tuning parameters that need to be set. (Often in ML, they are called hyperparameters, but this has a different meaning in Bayesian statistics, so I will try to avoid ambiguity.) This is generally done by human-in-the-loop experimentation, and more complex (hence flexible) procedures like neural networks can involve many such tuning parameters, which interact with one another. This gives rise to a notion of tuning the algorithm as art rather than science, but recently, there has been much interest in making the tuning part of the process more robust. You can see the problem, for the boss: the entire ML procedure that they have staked their reputation on depends on one geek who knows which buttons to push. That sort of situation keeps bosses awake at night.

Bayesian methods are the buzzword in tuning parameter optimisation. This holds out the promise of even including uncertainty about the tuning parameters in the final inference, although that is not what most ML people mean by Bayesian optimisation. Bayesian sampling algorithms are all about efficiently exploring parameter space to return the distribution of parameter values that fit the data. You can also apply this to explore tuning parameter space, to return the tuning parameters that lead to the best fit. The main difference is that there are only a few "observations" of tuning parameter values that have been tested; it is not a Big Data setting. In fact, it is generally our new friend Gaussian processes that are used as the tuning parameter model.

The ability to automate another part of the ML pipeline means ML engineer jobs being lost, but as always, there is a need for new skills, and those happen to be Bayesian sampling algorithms, probability, uncertainty and modelling small numbers of observations. If only there were a name for this new discipline...

Bayesian updating and windowing

Big Data might have passed the peak on the hype curve, but it remains alluring to many organisations and essential in some settings. Streaming data is a special case, where new data arrive too fast to be included by traditional means. The Covid-19 pandemic has thrown this into focus for many analysts. Epidemiological data, supply chains, and humanitarian aid have all had to be analysed urgently in a rapidly changing problem as data emerge.

The ability to update models rapidly as new data arrive will remain important for some time. That means analysing a new batch of data without having to re-run calculations on old data. In the streaming data paradigm, we might add the new data to a sliding window of time, and remove the oldest data from the window.

Here, Bayesian methods can help, by breaking down the posterior density of the parameters, for which we want estimates and uncertainty, into the product of the prior distribution and a sequence of likelihood functions for each batch.

Although this is alluded to in almost every introductory Bayesian class, it is rarely done because of the difficulty of defining a joint distribution for all unknowns together, and defining its hyperparameters from the posterior sample of the previous batch. Also, the success of the method is contingent on choosing a "right" form of joint distribution, and if you change your mind about it, you must return to the beginning again.

So, in practice, non-parametric updating methods are needed. These do not define a mathematical formula for the prior, but instead use the previous batch's posterior sample to estimate densities and gradients at any new values. This is a topic I have been actively involved in, and which shows promise for Big Data and streaming data settings. However, care has to be taken over the tuning parameters (those little devils again!) and data drift, requiring intensive expert input. It will continue to be a hot topic too, as the volume and velocity of data are only set to grow further.

Explainable AI

Parents everywhere warn their kids that some rowdy game is "fun until someone loses an eye". I have probably said it myself but was too tired at the time to remember. Fitting some complex predictive model to your data, launching it to the public and calling it "AI" is like that too. You're heading for promotion until it gets your employers bad PR because it is inadvertently biased against some group of people. There are many stories of this, and they are often painful to read.

The boss wants an assurance that your AI is "transparent" or "explainable". In other words, they want some best-practice procedure in place that they can cite should the worst happen. And as the boss is the one signing off the software purchase, you can be sure that ML vendors are climbing over each other to add explainability.

There are many ways to do this, which often involve either fitting simpler, interpretable models to the predictions in the locality of the point of interest, or comparing partly matched observations from the dataset. As an output, you can obtain measures of the importance of difference "features" or input variables, and an idea of what predictions would be obtained by perturbing the data. Simpler models, matching, variable importance measures, goodness-of-fit stats, smaller data... I suspect you can see where this is going.

Conclusion

Statistics and ML have often been presented as disparate skill sets, with adherents at loggerheads over which is best. This sort of story-telling gets clicks, but is a gross misrepresentation. Statistics and machine learning are two names for the same thing; they represent two paths by which people have arrived at data analysis: from mathematics or from computer science.

There are some cultural differences, just as Leo Breiman observed in the 2001 paper "Statistical Modeling: the two cultures", but increasing cross-fertilisation of ideas and influences as time goes by. As there is a premium for both groups of people in finding new, effective methods for new problems, we might reasonably expect further convergence.

Once, it was a time of Venn diagrams. Stats and computer science were represented as essential components of data science (whatever that is), along with domain knowledge. There were many variants of these diagrams, increasingly complex and eagerly shared or stolen by bloggers everywhere. A big mistake was introduced when the intersection of the three components was changed from data science to data scientist. These diagrams underwent their own hype curve, inspiring and then disappointing bosses everywhere when they failed to recruit the Unicorn at the centre. But maybe we should think again about the intersection of these three influences. ML has been through a boom time, so to make the next set of advances, wouldn't we expect to see stats and domain knowledge catching up in innovation and in the job market?

This article was written by medical statistician and trainer Robert Grant.

Visit Robert Grant's website.

Post your comment

Timberlake Consultants