# Optimising for beauty - the erasure of heterogeneity and an appeal for a Bayesian world view

--

*Part of a series of articles I’ll be posting over the next few days & weeks regarding thoughts from the past few years I’m only just getting round to writing about.*

*I believe many of the situations we face are generally of vastly complex nature and high dimensions, such that it is impossible for us to be able to see any situation in its entirety. All that we are able to perceive at any given time, is perhaps a tiny slice from a very specific perspective. And the best we can hope for, is to attempt to widen that slice by a tiny amount, and maybe even occasionally slightly shift our perspective. Unfortunately, contrary to popular belief, it’s not always being exposed to new ‘facts’ that helps us shift perspective — or at least, it’s definitely not enough. Instead we seem to need some kind of mental gymnastics to navigate to a new location inside our head, and gain these new view points. These mental gymnastics can often benefit from external nudging, and mental frameworks that act as structures around which we can build our thoughts. Traditionally this has been aided by natural language, evolved over thousands of years — but this isn’t always sufficient; and art for example is also critical to this end. I believe borrowing from other, more recent languages — such as computation — might also contribute to these mental frameworks.*

*In the spirit of the subject matter, doesn’t have to be read in any particular order:*

# Short version

An artificial neural network dreams up new faces whilst it’s training on a well-known dataset of thousands of celebrities. Every face seen here is fictional, imagined by the neural network based on what it’s seeing and learning. But what is it learning?

The politics of this dataset, who’s in it, who isn’t in it, how it’s collected, how it’s used and the consequences of this, is in itself a crucial topic, but not the subject of this work.

Indeed the dataset may not be representative of the diversity of the wider population. However, the learning algorithm itself provides a further layer of homogenisation. In these images produced by the neural network, whatever diversity was present in the dataset is mostly lost. All variety in shape, detail, texture, individuality and blemishes are all erased. They are smoothed out with the most common attributes dominating the results.

The network is learning an idealised sense of hyper-real beauty, a race of ‘perfect’, homogeneous specimens.

This is not a behaviour that is explicitly programmed in, it is an inherent property of the learning algorithm used to train the neural network. And while there are recent algorithms which specialise in generating more photo-realistic images (especially faces), the algorithm used here is one of the most widespread algorithms used in Machine Learning and Statistical Inference — Maximum Likelihood Estimation (MLE).

However, MLE is unable to deal with uncertainty. It’s unable to deal with the possibility that a less likely, a less common hypothesis might actually be valid.

MLE is binary. It has no room for alternative hypotheses. Instead, the hypothesis with the highest apparent likelihood is assumed to be unequivocally true.

MLE commits to this dominant truth, and everything else is irrelevant, incorrect and ignored. Any variations, outliers, or blemishes are erased; they’re blurred and bent to conform to this one dominant truth, this binary world view.

Perhaps not unlike the increasingly polarising and binary discourse in our increasingly divided societies.

# Long version

An artificial neural network dreams up new faces whilst it’s training on a well-known dataset of thousands of celebrities. Every face seen here is fictional, imagined by the neural network based on what it’s seeing and learning. But what is it learning?

The politics of this dataset, who’s in it, who isn’t in it, how it’s collected, how it’s used and the consequences of this, is in itself a crucial topic, but not the subject of this work.

When I first started seeing these kinds of results, I was fascinated by how everything was so blurry and smoothed out. Indeed the dataset may not be representative of the diversity of the wider population. However, the learning algorithm itself provides a further layer of homogenisation. In these images produced by the neural network, whatever diversity was present in the dataset is mostly lost. All variety in shape, detail, texture, individuality and blemishes are all erased. They are smoothed out with the most common attributes dominating the results.

The network is learning an idealised sense of hyper-real beauty, a race of ‘perfect’, homogeneous specimens.

This is not a behaviour that is explicitly programmed in, it is an inherent property of the learning algorithm used to train the neural network. And while there are recent algorithms which specialise in generating more photo-realistic images (especially faces), the algorithm used here is one of the most widespread algorithms used in Machine Learning and Statistical Inference — Maximum Likelihood Estimation (MLE).

That means — quite intuitively speaking — given a set of observations (i.e. data points), out of all possible hypotheses, find the hypothesis that has the *maximum likelihood *of giving rise to those observations. Or in fewer words: given some data, *find the hypothesis which is most likely to have produced that data*.

It makes a lot of sense right?

But it has some shortcomings.

Imagine we find a coin on the street. And we’d like to know whether it’s a fair coin, or weighted (i.e. ‘biased’) towards one side, possibly favoring heads or tails. We have no way of determining this by physically examining the coin itself. So instead, we decide to conduct an experiment. We flip the coin ten times. And let’s say we get 7 heads and 3 tails.

A maximum likelihood approach would lead one to conclude that the most likely hypothesis that can give rise to these observations, is that the coin is biased in favor of heads. In fact, such a ‘frequentist’ might conclude that the coin is biased 7:3 in favor of heads (with a 26.7% probability of throwing 7 heads). An extreme frequentist might even commit to that hypothesis, unable to consider the possibility that actually there are many other alternative hypotheses, one of which — though less likely — might actually be the correct one. E.g. it’s very possible that actually the coin is indeed a fair coin, and simply by chance we threw 7 heads (for a fair coin, the probability of this is 11.7%, not negligible at all). Or maybe, the coin is biased only 6:4 in favor of heads (probability of throwing 7 heads is then 21.5%, still very likely). In fact the coin might have any kind of bias*,* *even in favor of tails. *In this case it is very unlikely that we would have thrown 7 heads, but it’s not *impossible.*

Some of you might be critical of this example, citing that 10 throws is far too small a sample size to infer the fairness of the coin. “Throw it 100 times, or 1000 times” you might say. Heck, let’s throw it a million times.

If we were to get 700,000 heads out of 1 million throws, *then* can we be *sure* that the coin is biased 7:3 in favor of heads?

For practical reasons, we might be inclined to assume so. We have indeed quite radically increased our *confidence* in that particular hypothesis. But again, that does not eradicate the possibility that another hypothesis might actually be the *correct* one.

Evidence is not *proof*. It is merely *evidence*. I.e. a set of observations that increase or decrease the likelihood of — and our confidence in — some hypotheses relative to others.

MLE is unable to deal with uncertainty. It’s unable to deal with the possibility that a less likely, a less common hypothesis might actually be valid.

MLE is binary. It has no room for alternative hypotheses. Instead, the hypothesis with the highest apparent likelihood is assumed to be unequivocally true.

MLE commits to this dominant truth, and everything else is irrelevant, incorrect and ignored. Any variations, outliers, or blemishes are erased; they’re blurred and bent to conform to this one dominant truth, this binary world view.

So are there alternatives?

At the opposite end of the spectrum, one could maintain a distribution of beliefs over *all possible hypotheses*, with varying levels of confidence (or uncertainty) assigned to each hypothesis. These levels of confidence might range from “I’m really very certain of this” (e.g. the sun will rise again tomorrow from the east) to “I’m really very certain this is wrong” (e.g. I will let go of this book and it will stay floating in the air) — and everywhere in between. Applying this form of thinking to the coin example, we would not assume absolute truth in any single hypothesis regarding how the coin is biased. Instead, we would calculate likelihoods for every possible scenario, based on our observations. If need be, we can also incorporate a prior belief into this distribution. E.g. since most coins are *not* biased, we’d be inclined to assume that this coin too is not biased, but nevertheless, as we make observations and gather new evidence, we will update this distribution, and adjust our confidences and uncertainties for every possible value of coin bias.

And most critically, when making decisions or predictions, we don’t act assuming a single hypothesis to be true, but instead we consider every possible outcome for every possible hypothesis of coin bias, weighted by the likelihood of each hypothesis. (Those interested in this line of thinking can look up *Bayesian logic or statistics*).

But, as I’ve mentioned in my other works and texts, my main interest is not necessarily these algorithms themselves. In a real world situation, no fool would settle on a maximum likelihood solution with only 10 samples (one would hope!).

My main interest is in using machines that learn as a reflection on ourselves, and how we navigate our world, how we learn and 'understand', and ultimately how we make decisions and take actions.

I’m concerned we’re losing the ability to safely consider and parse multiple views within our social groups, let alone within our own minds. I’m concerned we’re losing the ability to recognize and navigate the complexities of situations which may require acknowledging — or even understanding — the existence of lines of reasoning that may lead to different views than our own; maybe even radically different, opposing views. I’m concerned that ignoring these opposing lines of reasoning, and pretending that they don’t exist, might be more damaging in the long run; compared to acknowledging their existence and trying to identify at what point(s) our lines of reasoning are diverging, and how, and why, and trying to tackle these more specific issues at point; whether they are due to what we consider to be incorrect premises, flawed logic, differing priorities or desires.

I’m concerned we’re becoming more and more polarized and divided on so many different points on so many different topics. Views, opinions and discourse in general seems to be becoming more binary, with no room to consider multiple or opposing views. I’m concerned we want to ignore the messy complexities of our ugly, entangled world; to erase the blemishes; and commit unequivocally to what seems (to us) to be the most likely truth — the one absolute truth, in a simple, black and white world of unquestionable rights and wrongs.

Or maybe that’s just my over-simplified view of it.

# Acknowledgements

Made using the ‘CelebA’ dataset and Variational Autoencoders

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang, “Deep learning face attributes in the wild,” *Proceedings of the IEEE International Conference on Computer Vision* **2015 Inter**, 3730–3738 (2015).

Diederik P Kingma and Max Welling, “Auto-Encoding Variational Bayes,” *arXiv preprint arXiv:1312.6114*, (2013).