In contrast to supervised learning techniques, unsupervised learning techniques operate without known outputs or observations – that is, these techniques are not trying to predict any specific outcomes. Instead, unsupervised techniques attempt to uncover patterns within data sets. Unsupervised learning is a useful approach for problems that do not have sufficient output or example data to train a supervised model.
Unsupervised techniques include clustering algorithms that group data in meaningful ways. Clustering algorithms are used, for example, to identify and segment retail bank customers who are similar, or to identify similar sensor data feeds from equipment. Examples of clustering algorithms include k-means – a method to create subgroups of similar data points using “distance” between data points based on features – and Gaussian mixture models (GMM) – a method to identify subgroups of similar data points using statistical probability distributions.
Consider a simple machine learning model, as shown in the following figure, that aims to cluster data into four categories (shown in the following figure as blue, yellow, red, and green). Here, the algorithm was told to look for four categories, but the categories were not pre-defined or labeled in the training data.
This type of clustering model can be applied to the customer attrition prediction example discussed before, and can be used, for example, to identify groups of similar customers. Although no outputs or labels are known, an analyst can review the clusters to understand buying behavior and identify outlier customers or groups of customers who may be at risk of attrition.
Unsupervised Learning: No Ground Truth
Figure 9 Unsupervised machine learning models do not require labels to train on past data. Instead, they automatically detect patterns in data to generate predictions. This example illustrates a clustering algorithm.
Another example of an unsupervised learning technique is dimensionality reduction. One of the central problems in machine learning is representing human-interpretable patterns in complex data. Advanced data science problems may involve work with large volumes of high-dimensional data, such as pixels in images, sensory measurements of equipment, or human-gene distributions.
Dimensionality reduction is a powerful approach to construct a low-dimensional representation of high–dimensional input data. The purpose of dimensionality reduction is to reduce noise so that a model can identify strong signals among complex inputs – i.e., to identify useful information.
High dimensionality poses two challenges. First, it is hard for a person to conceptualize high-dimensional space, meaning that interpreting a model is non-intuitive. Second, algorithms have a hard time learning patterns when there are many sources of input data relative to the amount of available training data.
Examples of dimensionality reduction models include autoencoders, an artificial neural network approach that “encodes” a complex feature space to capture important signals, and principal component analysis (PCA), a statistical method that uses linear methods to combine a large number of input variables to generate a smaller, more meaningful set of features.
One of the most useful – and significant – purposes of unsupervised machine learning is to perform anomaly detection. Anomaly detection is an approach to define normal behavior in a data set and identify inconsistent patterns. Using anomaly detection, it is possible to predict similar patterns that require labeled history using a supervised model, such as identifying abnormal equipment behavior or recognizing a faulty sensor. The following figure shows example output of an anomaly detection algorithm. Anomalies are highlighted as red circles.
Figure 10 Example of an unsupervised machine learning model for anomaly detection.