Machine learning falls into two broad categories: supervised and unsupervised learning. Even in the wake of many recent successes of unsupervised AI, supervised model-driven AI is still highly relevant. 

While supervised learning starts with a manually specified model of desired outputs, unsupervised learning starts from the data, making its own inferences about the structure it finds in that data. 

Although recent years have witnessed a number of striking successes of unsupervised learning, particularly on clustering and representation learning tasks, unsupervised learning remains plagued by challenges, most notably the infamous black box – the lack of external insight into how and why an unsupervised system makes the particular inferences it does. Now that the European Union’s General Data Protection Regulation (GDPR) is in effect, organizations are under pressure to implement explainable and trustable AI.

Here, we take a closer look at the interplay between supervised and unsupervised learning, examining the benefits and drawbacks of each.

What’s so great about unsupervised learning?

The recent strides of unsupervised learning

As the costs of cloud-based computing continue to drop, innovations in automated data preparation have simplified many traditionally time-consuming aspects of data analysis, including labeling, sorting and classification.

For example, a 2016 report by Gartner estimated that poor data quality could cost an organization as much as $13.5 million each year; and according to a CrowdFlower survey of data scientists, a full 76 percent considered data preparation the most labor-intensive aspect of their work. Thus, smarter data processing means easier, faster, cheaper unsupervised analysis.

The Sherlock of pattern recognition

Unsupervised learning has already demonstrated its accuracy at identifying subtle structure in datasets where humans have difficulty seeing patterns. When working with datasets including many features, such as visual images and videos, unsupervised algorithms can rapidly classify and cluster data using far fewer features than humans might specify, making data processing even faster and more efficient.

No overfitting for this big guy

What’s more, unsupervised machine learning circumvents several well-known downsides of supervised learning (model-driven) algorithms – including the bias-variance tradeoff, in which a low variance of inputs provided in the training dataset can result in higher error rates during the inference phase, because the model hasn’t been properly trained to recognize rare and unexpected features.

But for all these benefits, unsupervised machine learning does have its downsides.

Key drawbacks of unsupervised learning 

Lack of accountability in artificial intelligence

At the basic design level, most machine learning algorithms are fairly easy to characterize. For example, a simple algorithm that produces output Z as a function of input Y and calculation X could be expressed as “Z = XY.”

But while Z and X might be simple variables, challenges begin to arise when X is not a straightforward set of “if-then” calculations, but a nuanced deep learning architecture composed of millions of ever-changing relationships.

In such a deep learning architecture, the algorithm’s exact method of data analysis becomes far too complex to characterize with any precision.

This is partly because the algorithm’s analysis method itself evolves independently as the algorithm learns.

In other words, the more nuanced the algorithm’s analysis becomes, the harder it gets to describe how that analysis takes place – just as it’s easy for you to describe how you solve an arithmetic problem, but near-impossible to explain how you experience the beauty of a painting.

This is the core challenge of explainable AI – a major problem in today’s machine learning landscape. And it’s far more than just a technical challenge. As of 2018, the GDPR requires organizations to provide “fair and transparent processing” of consumer data, including the use of “appropriate” and explainable machine learning models to analyze that data. “Black box” systems just don’t cut it anymore.


Data lakes are more like data swamps

Unsupervised learning requires much larger datasets than supervised learning – and that’s not always possible for teams working with limited data. While it’s well known that larger datasets drive more accurate analysis, the processes of gathering and cleaning this data are often costly and time-consuming, while data lakes — storage media that hold the data during analysis — can be more like disorganized, expensive “data swamps.

As data storage becomes more affordable, and approaches to data labelling become more standardized, existing data lakes can stop sitting around like moldy ponds, and start being challanneled into the service of meaningful machine analysis.

Classification, categorization, problem solving: supervised algorithms are still kings of their realms

Counterintuitive as it may be, supervised algorithms (particularly logistic regression and random forest) tend to outperform unsupervised ones on discrete classification and categorization tasks, where data is relatively structured and well-labeled.

Moreover, unsupervised learning has been shown to perform poorly when used for problem-solving, planning and decision-making tasks. These drawbacks make unsupervised learning inadequate for critical applications like self-driving cars and computer vision, where the costs of even small errors can be extremely high, and even fatal for humans.

Unsupervised and supervised learning united

Thus, even in the wake of many recent successes of unsupervised data-driven AI, supervised model-driven AI is still highly relevant. For researchers today, however, the primary challenge is not to develop supervised learning models that supercede their unsupervised counterparts, but to develop these two faces of AI in tandem.

When the benefits of supervised and unsupervised AI are leveraged in support of one another, we can begin to open the black box, and create regulation-compliant explainable AI for the next generation of applications, in a world where AI will be omnipresent in all areas of our lives.

Originally published on July 11, 2018 Topics: Machine Learning Data Science Datasets


You may also like: