Member-only story
Clustering is a method which consists on grouping data points (clients, texts, images…) based on similarities. Clustering is an unsupervised machine learning problem that aims to process data and find similar structure in a set of data without any target values (dataset without labels).
Clusters are groups similar elements that differ from the elements in other clusters.
Clustering benefits are many and varied depending the field:
- Client clustering: optimize and adapt strategy based on behavior
- Increase company productivity: deal with groups of clients and not clients (reduce worklaod)
Clustering Types
- Hierarchical Clustering (e.g., CAH)
- Centroid-based Clustering (e.g., K-means)
- Density-based Clustering (e.g., DBSCAN)
- Distribution-based Clustering (e.g., DBCLASD)
Clustering Workflow
In this article, we will cover different clustering algorithms: K-means, CAH, Optics, DBSCAN…
Basically, the final result should be the same, if the algorithm is well executed, whatever the selected one.
K-means:
K-means is an iterative algorithm. To process clustering, you should define the number of cluster k that you want. Given the number of cluster k, we start by randomly select k centroids (centers). The next step is to attaching each data point to the nearest center and then we compute the barycentre of each one and define the barycentre as the new centroid.
We iterate this step until that no changes happen, and the centroids are the same at the step n and the step n-1.
# Import libraries
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np
# Function to generate data which represent 4 clusters
def generate_data(mean_, variance, n_points):
n = n_points // 4
X = np.random.normal(loc=mean_, scale=variance, size=(n_points, 2))
X1 =…