Machine Learning failure in algorithms, data, models and context.

11 min readMay 6, 2016

Darpa challenge fails. from https://www.youtube.com/watch?v=g0TaYhjpOfo

I’d like to talk about ‘algorithm failure’, and especially bias, in context of machine learning. Specifically, I’d like to talk about the language and potential points of failure in the hope that it will widen the conversation to include those who are crucial in the discussion — the people who implement and deploy machine learning systems.

There are a few distinct possible points of failure in a machine learning system. Referring to them all collectively as ‘algorithm failure’ has its places (e.g. when discussing the impact of the end result of the failure). However, I think it is a semantic simplification that obfuscates the cause of the failure — i.e. at which point(s) the failure(s) occurred — thus potentially alienating an audience who could be essential in preventing such failures in the future. So I’d like to expand the terminology a bit.

Background Theory

Almost all of the recent stories in the press regarding AI — whether it’s machine translation, Siri, financial forecasts, deepdream, computers imitating artists’ styles or AlphaGo beating Lee Se-dol at Go etc — are driven by almost the same (or a family of) learning algorithms, with different bells and whistles to suit each case. It’s well documented now even in the mainstream media, that on their own these learning algorithms cannot do anything. The learning algorithms themselves are simply abstract mathematical (primarily statistical) formulations. They require examples (i.e. training data) to give shape to how the algorithms will eventually make predictions. It’s only when the learning algorithms have been trained on such data, are they able to convert the abstract mathematical formulations into concrete decisions and predictions.

However, this is still not enough to understand the wider implications of such data-driven systems. Without the full context, trying to understand the impact of an algorithm is like trying to understand the impact of the command ‘turn right’ — which is meaningless without knowing your current position, which direction you’re facing, how you’re moving (if at all), what’s around you, and last but not least, what you are trying to achieve or avoid.

For deep artificial neural networks (i.e. the cases mentioned above and most of the current cutting edge developments) this comes in a few very related, yet separate, components [1]:

Training (Learning) phase:

We require examples, i.e. training data. This will determine all of the other factors. E.g. The type of data (image vs sound vs geolocation vs personal data vs sensor data etc) will heavily influence the architecture and possibly the learning algorithm that we decide to use. Even the amount of training data will affect the decisions we make in how to architect the system.
Optionally we may preprocess the training examples, or even manually design processing pipelines to extract specific features before training. (This is slightly less relevant in Deep Learning. In fact one of the motivations behind deep learning is to skip this ‘hand-crafted feature engineering’ step, and learn the features end to end).
Then we need to design an architecture that suits our data, and our end goal. Without an architecture we have nothing to train with.
Then we run the training examples through the architecture using a learning algorithm. (For sake of brevity, I’ll include everything related to ‘learning’ here, such as objective functions, optimisation, regularisation etc., even though each one of these is a whole topic in itself).

The end result of this training session is called a model — because we’re trying to build a model of a particular system, to understand and predict its behaviour.

Deployment / Prediction phase:

5. Finally, we deploy and use the model. We feed the model new data. And the model will try to produce predictions or decisions (e.g. “that is a picture of a gorilla”, “the patient has cancer”, “steer left”, “arrest that man” etc).

For this model to work successfully, all of these steps and components need to be perfectly aligned and compatible with one another in context of our end goal. Most crucially, it may be that each of the components are individually successful in a slightly different context, but used together, they may not be successful for a particular use case.

One could argue that this end result — the model — is in itself ‘the algorithm’ that makes the predictions or decisions. One could say ‘the algorithm failed to predict accurately’, and what they mean is ‘the model failed to predict accurately’. And yes one could say that. After all, we give ‘the algorithm’ (i.e. model) an input, it follows a number of steps and operations and computes an output — by definition that is an algorithm.

However I find this semantic simplification likely to cause confusion, because ‘the algorithm’ that is failing is not necessarily the learning algorithm itself (though it could be). It is the final trained model. And the model might fail for many reasons. Not necessarily because of the learning algorithm (used to train the model with the specific training data and specific architecture), but it could be the architecture, or the training data, or the way the training data was preprocessed, or the context in which the model was deployed.

All of these components need to be perfectly suited to each other to get the best results (i.e. a model that predicts accurately or makes desirable decisions). However, ultimately each of these components are separate, independent components that can be plugged in from different problems or even different domains. Most critically, all of these components which need to be perfectly compatible, might have been designed or sourced by different people for different tasks. And while they may be perfectly suited for the task they were originally designed for, they may not be perfectly suited for the task they are currently being used on. And that’s where the points of failure, and related responsibilities become very important.

For Example

Sarah wants to detect pedestrians in CCTV footage. She downloads a database of CCTV footage she finds online collated by Nicole. She uses an architecture she finds in a paper designed by Tom that is very successful in detecting cats. She uses a learning algorithm she finds in another paper designed by Adam for these kinds of problems (e.g. Tom also used this learning algorithm to train his model to detect cats).

Sarah takes all of these, tweaks them a little to suit her problem, and trains a model. I.e. she creates a model to detect pedestrians in CCTV footage using Nicole’s training data, Tom’s architecture and Adam’s learning algorithm, all with tweaks to suit her problem.

She then uses this model to detect pedestrians on her own CCTV footage, and that’s the context in which she deploys the model.

If this model fails (i.e. it fails to detect a pedestrian, or incorrectly detects a pedestrian where there isn’t one), I think we can all agree who’s at fault. It’s not Nicole— who independently collected CCTV footage. It’s not Tom — who designed an architecture independently of this particular problem and published an architecture very successful for what it was designed for (detecting cats). Similarly it’s not Adam— who published a learning algorithm that was successful in the tasks that he designed it for.

Sarah is the one who brought everything together, trained the model, and deployed it. It’s her responsibility to make sure the model works, and all of the individual components — training data, architecture, learning algorithm etc. — suit her problem [2].

This is not an unrealistic, hypothetical example. It is not uncommon to use architectures or learning algorithms as found in papers published by AI researchers who dedicate years trying out and testing different architectures and learning algorithms for different types of problems. It’s also not uncommon to use datasets found online to train our models. So it’s totally understandable that Sarah would try and create her model using Nicole’s training data, Tom’s architecture and Adam’s learning algorithm. But she has to take on the responsibility to make sure that she has enough of an understanding of each component to make them fit her problem perfectly [3].

But imagine this, imagine that Sarah’s model does work perfectly. She does a great job using Nicole’s training data, Tom’s architecture, Adam’s learning algorithm, and trains a model that correctly finds pedestrians on all of her tests and use cases. She runs the model for many years and it works perfectly. This is a very successful model and everyone is happy!

So she puts this model online. Not necessarily as a service for others to use, but she’s sharing her research.

Years later Peter also needs a system to find pedestrians in his CCTV footage. He reads Sarah’s paper, looks at her use case and decides that it’s very similar to his. So he downloads Sarah’s code and pre-trained model and tests it with his footage. He tests it for months, and it works perfectly for him too. Great, so he deploys it. But then it fails catastrophically. Why?

There could be many reasons, but let’s pick one single simple reason why it may fail. Let’s assume it’s because his CCTV footage is interlaced, whereas Sarah’s (and Nicole’s) footage is not (i.e. it’s non-interlaced).

It turns out that when Peter had tested Sarah’s model for those few months, the pedestrians were always walking slowly, so the interlacing artefacts weren’t significant and Sarah’s model worked for him. But when the model encountered a fast running pedestrian for the first time, many months later, the interlacing artefacts became significant and the detection failed. In Sarah’s use case, she ran the model for many years and it worked fine, even successfully detecting fast running pedestrians. Because her footage is non-interlaced, and that’s what she trained it on — i.e. Nicole’s data. But the same model didn’t work for Peter, even though his task is identical and his data is very similar, but not similar enough.

So the problem is not in Sarah’s model itself. It’s in the context in which Peter deployed the model.

Is Sarah at fault for not having made her model compatible with interlaced footage? If she provided this model as a commercial service, an ‘end all and be all for pedestrian recognition’ then yes she should have made it compatible with interlaced footage. But if she released the model simply as research which works for her use case, and especially with a typical open-source license: ‘THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED’ then she should not be expected to cover all use cases — and this is how most research is shared these days. However she should probably make a very clear note of ‘trained with and only tested with non-interlaced footage’. But then, in this day and age who still uses interlaced footage? It’s perhaps not unreasonable to expect that this didn’t cross her mind. [4].

But I’m only giving the interlaced vs non-interlaced property of the data as an easy to understand hypothetical example. It’s actually very easy to immediately see the difference between interlaced and non-interlaced footage, so if this were the case Sarah or Peter would have noticed it. However, there are many properties of data that are not be so immediately obvious, and can cause similar failures. E.g. continuing on this pedestrian detection example, one can immediately think of real world examples including race, gender, size, shape or even types of clothing (which could be related to culture or economic / social status) to be important factors for failure. And most importantly, failure of the model in the context it’s being deployed. E.g. a model trained for and very successful in New York might not work well in Kabul. Or even a model trained for and very successful in most parts of London might not work in Stamford Hill (an area of London home to Hasidic Jews).

Ultimately it’s Peter’s responsibility — as the person deploying the model — to make sure that the model he decides to use fits his problem and purpose.

But this is complicated by the fact that he didn’t actually train the model. He might not even have access to the training data. He saw that it successfully worked for Sarah for many years. He also saw that his problem appears to be almost identical to Sarah’s (this is the mistake he ultimately makes). He also saw that it worked successfully for him after months of testing. It is ultimately a ‘user error’ on Peter’s part. But when he doesn’t even have access to the training data, the architecture, or the learning algorithm used, and he’s so far removed from the creation of the model (he might be the 5th person in the loop, after Nicole, Adam, Tom, Sarah) how could he have predicted this failure?

Closing Thoughts

These are just some of the complications that might (and are quite likely to) occur and cause a model to fail. There’s a whole other conversation which is related to the question of what it even means for a model to be ‘successful’; i.e. one that ‘predicts correctly’, or makes ‘desirable decisions’. Then we can ask relative to whom are these predictions or decisions ‘correct’ or ‘desirable’? This gets much muddier when we factor in capitalism, and profit motivated ‘AI-as-service’ business models. In this article I purposefully just wanted to focus on what it meant for a model to fail, and expand on potential causes. I leave the discussion around what it means for a model to be successful (but relative to whom?) for another time.

Notes

All of these steps aren’t always necessary or applicable in this way. E.g. in online learning or reinforcement learning, there’s isn’t always a separate ‘training’ vs ‘prediction’ phase. The model constantly learns as it gathers new data and simultaneously predicts and learns. And most probably in the future, as more general purpose learning algorithms are developed, different architectures might be unified into a single universal learning architecture. Or maybe the architecture itself will be learnt as part of the training. Or perhaps the whole current paradigm of deep artificial neural networks might be scrapped and an alternative approach, such as bayesian or hierarchical temporal memory might be adopted with a single unified architecture and universal learning algorithm. But even if it does all drastically change in the future, the distinctions of [data] + [learning algorithm] -> [model to make predictions] will likely remain.
Of course if Nicole was tasked with providing training data specifically for this problem, and if the model failed because the data was inadequate — e.g. not enough variety, too much bias — then Nicole is at fault. Likewise if Tom was tasked with designing the architecture specifically for this problem, and the model failed because of unsuitable architecture — e.g. failed to learn necessary features — then Tom is at fault. Likewise if Adam was tasked with designing the learning algorithm specifically for this problem, and if the model failed because of an unsuitable learning algorithm — e.g. it over-fit or failed to converge — then Adam is at fault.
It’s worth pointing out that there aren’t a fixed X number of architectures and Y number of learning algorithms that one can use, each one suited to a particular problem. Instead there are an almost infinite number of variations, adaptations, bells & whistles that can be tacked onto architectures and learning algorithms to make them work for your problem. The current state is almost like ‘designing a song’. While there are many different distinct structures and instrumentations one can start from to design a song, there are an infinite number of variations that one can add, mix and match.
If it had occurred to Sarah or Peter to check for interlaced vs non-interlaced footage, ultimately ‘fixing’ the model to accommodate for interlaced footage would have been very simple, with a few different options. E.g. Peter pre-processes the footage to make it non-interlaced and uses the model as is, or Sarah adds interlaced footage to the training data, and/or modifies the architecture to accommodate.

Machine Learning failure in algorithms, data, models and context.

Background Theory

For Example

Closing Thoughts

Notes

Written by Memo Akten

No responses yet