Understanding Clustering for Machine Learning
Since we’re talking about Clustering for Machine Learning, let’s start by understanding what we mean by Cluster.
The clustering process is based on some predefined criteria. Clustering is an unsupervised learning technique in which there are no predefined classes and no examples or prior information. demonstrating how the data should be grouped (or labeled) into separate classes.
Clustering could also be considered an Exploratory Data Analysis (EDA) process which allows us to discover hidden pattern of interest or structure in the data. It could be a standalone tool to get insight into data distribution or as a preprocessing step for other algorithms.
Types of Machine Learning Algorithms
 Supervised Learning Models: Algorithms where we train a model by using a class label. That is, Classification (or label) of the sample dataset are wellknown and predetermined to predict labels or classifications for test cases.
 Unsupervised Learning Models: We have little to no prior knowledge of the results or data grouping. We need to check the relationship in the data to determine appropriate clustering. It requires exploring the dataset to discover the natural grouping (or cluster) each of the observations may belong.
 SemiSupervised Learning Models: Here we have some dataset (or observations) that have labels while others do not have labels.
Supervised Learning ("Y" is Known)  Unsupervised Learning ("Y" is Unknown)  SemiSupervised Learning (Sometimes we know "Y") 

Regression (Lasso, Ridge, Logistic)  Clustering (Kmeans clustering, Mean shift clustering, Spectral Clustering)  Prediction & Classifications 
Decision Tree (Gradient Boosting & Random Forest)  Apriori Rule  Clustering 
Neural Network  Kernel Density Estimation  ExpectationMaximization (EM) 
Support Vector Machine (SVM)  Principal Component Analysis (PCA) (Kernel PCA, Sparse PCA)  Transductive Support Vector Machines (TSVM) 
Naive Bayes  Singular Value Decomposition (SVD)  Manifold Regularization 
KNearest Neighbor (KNN)  SelfORganizing Machine (SOM)  Autoencoder (Multilayer Perceptron, Restricted Boltzmann machine (RBM)) 
Simple Example of Clustering of Machine Learning
Assume that we have dataset containing information about people in different occupations, their countries, age, education, family sizes, stores where they shop and the products they purchase in those stores.
We can easily group them based on their locations, countries, continent, etc
or simply group them based on job titles, job categories (healthcare, engineering, professional, executives, etc). However, sometimes such single variable classifications are not sufficient. We might need to categorize the dataset based on multiple features (attributes, columns) and take advantage of more intrinsic behavior /pattern that are not very obvious by looking at the dataset.
This approach is called Clustering. An example in market segmentation might involve clustering based on their shopping behavior, where they shop, what they buy and quantity of purchases.
How Do We Define Good Clustering Algorithms?
Clustering involves several minimization and maximization steps. First, we want to reduce the distance between objects in the same cluster also called Intracluster minimization and Intercluster maximization.
Intracluster minimization – The closer the distance between two or more objects the more likely they will belong to the same cluster. Alternatively, within the clusters, the distance between any two objects are minimized.
Intercluster maximization – This is an approach that establishes the independence/separation between any two clusters. Goal is to separate the two clusters as much as possible. Two possible ways to achieve this goal are: (1.) Maximize the distance between the midpoint of any two clusters (2.) Maximize the distance between the closest points in each cluster.
There are several clustering algorithms and they all use different techniques to meet the clustering goal. They could be classified into Partitioning algorithms or hierarchical algorithms.
Clustering (By Partitioning Algorithm) – We construct certain numbers of partitions and then evaluate them based on some criteria. Example includes Kmeans algorithm where we have no idea of how many partitions exist in the dataset but we generate the partitions based on the intracluster distances – distance between the objects in the cluster.
Clustering (By hierarchical Algorithm) – We decompose the data into different set or object using predefined rules. This approach is particularly useful when we are not sure of the number of clusters/partitions to create. however, the approach can be annoyingly slow in a big data environment!
Clustering Algorithms and Membership Information
Objects in clustering algorithms can have a Hard or Soft membership.
Hard clustering occur in cases where the observation can belong to exactly one cluster while Soft clustering can belong to more than one cluster to a certain degree of likelihood. Membership characteristics is very important in real application because it could be very hard to put sharp boundaries between different clusters.
For example, in ecommerce application, we can categorize a pair of sneakers in 2 clusters: (1.) Sport Apparel and (2.) Casual Shoes.
Question to Ask When Choosing Clustering Algorithms:
Choice of clustering algorithm for machine learning are heavily influenced by the use case and the data that are available. Some of the questions that can help in selecting the best clustering algorithm are:
 Is the algorithm scalable?
 Does it handle different kind of attributes? (categorical vs numerical)
 Is it compulsory to specify number of clusters beforehand?
 How much control is required on the parameter and the output?
 How would we want to handle noise and outliers?
 Can it the algorithm handle higher order dimensional data?
 Are the results interpretable and make sense when profiled?
Data Preparation – is also a very significant step in clustering. Understanding that real world data is noisy and can be strongly skewed, they may have weak clustering structure and include bad outliers. The amount of variability in the data also may be a concern.
Effective ways to prepare data for clustering might involve the following steps:
 Variable transformation – changing the range of a variable (e.g. Standardization)
 Changing the distribution of the dataset
 Perform variable selection or outlier removal.
Examples of Popular Clustering Algorithms

KMeans Algorithm
We are interested in minimizing the distance between each cluster.
 The number of cluster is predetermined (e.g. 4 clusters). The cluster centers C can be selected randomly or using some heuristic methods or some smart initial iteration techniques.
 Assign observations i to specific clusters based on nearest distance to cluster center C.
 Recalculate the cluster centers C after the assignment and computer each of the 4 cluster centers C as the mean of the cluster. Reassign observations i to new cluster centers.
 Repeat step 3 until certain stopping criterion is satisfied. (e.g. distance between the cluster center – mean – and object is lower than certain value)
Possible stopping (or terminating) conditions could be (1.) certain number of iteration is reached, (2.) Certain number of observations do not change their clusters, i.e. those observations are near their optimal values (3.) Cluster center – Centroid – C do not change their positions, (4.) Squared error is less than some small threshold value α.
KMeans Clustering Advantages
 Simple, understandable and efficient
 KMeans is a partitioning algorithms so objects are automatically assigned to clusters
 Can be used as a preclustering step where other algorithms can be applied on smaller subspaces
The Disadvantages
 Must select specific number of cluster K to begin
 All items are forced into a cluster
 Too sensitive to outlier and noise
 Does not work well with noncircular cluster shape
When Do We Apply Clustering Algorithm?
 Segmentation – Customer, product and stores
 Anomaly Detection –
 Outliers typically belong to clusters with one observation
 Identifying fraud transactions
 Preparatory step for other techniques
 Summarizing a document = cluster and use centroids
 Predictive modelling on segments
 Logistic regression result can be improved by using small clusters
 Missing value imputation
 Decrease dependence between attributes