Clustering is a Machine Learning approach that groups data points together. We can use a clustering method to classify each data point in a certain group given a set of data points. Today, we’ll look at various clustering methods that data scientists should be aware of, as well as their benefits and drawbacks!
Clustering Algorithms in Python Machine Learning
Let’s now explore the different clustering algorithms in Python that we can use for Machine learning!
1. K-Means Clustering Algorithm
The most well-known clustering algorithm is undoubtedly K-Means. It’s covered in many beginning data science and machine learning courses. It’s simple to grasp and implement in code! For an illustration, see the graphic below.
Steps Involved in K-Means Clustering
First, we choose a few classes/groups to use and randomly initialize their respective center points. To determine the number of classes to employ, take a brief look at the data and look for any identifiable groupings. The center points are vectors of the same length as each data point vector, and they are the “X’s” in the figure above.
Each data point is classified by calculating the distance between it and the center of each group and then identifying the point as belonging to the group whose center is closest to it.
We recompute the group center based on these classed points by taking the mean of all the vectors in the group.
Repeat these processes for a predetermined number of iterations or until the group centers do not change significantly between iterations. You can also choose to randomly initialize the group centers a few times before selecting the run that appears to have produced the best results.
Advantages and Disadvantages of KMeans Clustering
K-Means has the advantage of being relatively fast because all we’re doing is computing distances between points and group centers; very few computations! As a result, it has an O(n) linear complexity.
K-Means, on the other hand, has a few drawbacks. First, decide how many groups/classes there will be. This isn’t always easy, and ideally, we’d want a clustering algorithm to figure it out for us because the goal is to obtain insight from the data.
K-means also starts with a random selection of cluster centers; therefore, different clustering results may be obtained on different runs of the method. As a result, the findings may be unpredictable and inconsistent. Other clustering approaches are more reliable.
2. Mean-Shift Clustering Algorithm
Mean shift clustering is a sliding-window method that seeks out dense clusters of data points. It is a centroid-based technique, which means that the purpose is to find the center points of each group/class by updating candidates for center points to be the mean of the points within the sliding window.
In a post-processing stage, these candidate windows filter in such a way that it eliminates near-duplicates, yielding the final set of center points and their related groups.
Steps Involved in Mean Shift Clustering
To explain mean-shift, consider the graphic depiction of a set of points in two-dimensional space. We start with a circular sliding window centered at a point C (randomly chosen) like the kernel. Mean shift is a hill-climbing algorithm that includes moving this kernel to a higher density region iteratively on each step until convergence.
The sliding window adjusts towards higher density regions for each iteration by adjusting the center point to the mean of the points within the window. The density of the sliding window is proportional to the number of points contained within it.
Naturally, changing the mean of the points in the window will lead to a steady movement towards locations with higher point density.
We keep shifting the sliding window according to the mean until there are no more locations inside the kernel that a shift can accommodate. Examine the graph above; we continue to move the circle until we no longer increase the density that is the number of points in the window.
These steps work on repeat with many sliding windows until the window includes all the points. When multiple sliding windows overlap, the one with the most points is kept. The data points are clustered according to the sliding window in which they are located.
Advantages and Disadvantages of Mean Shift Clustering
In contrast to K-means clustering, there is no need to specify the number of clusters because mean-shift does so automatically. That’s a huge benefit.
The cluster centers converging towards the points of maximum density is also desirable because it is simple to understand and fits well in a naturally data-driven perspective. The disadvantage is that determining the window size/radius “r” can be difficult.
3. Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
DBSCAN is a density-based clustering method similar to mean-shift but has a few noticeable advantages.
Steps Involved in DBSCAN Clustering
DBSCAN starts with an arbitrary, previously unvisited beginning data point. This point’s neighborhood is determined using a distance epsilon (all points within the distance are neighborhood points).
If there are sufficient points (as determined by minPoints) in this neighborhood, the clustering process begins, and the current data point becomes the first point in the new cluster. If not, the point will be noise (later, this noisy point might become part of the cluster). That point is “visited” in both situations.
The points inside its distance neighborhood become members of the same cluster as this first point in the new cluster. This assigning all points in the neighborhood to the same cluster is then performed for any new points to the cluster group.
Steps 2 and 3 repeats until all points in the cluster are determined, i.e., all points within the cluster’s vicinity have labels.
When we finish with the current cluster, we retrieve and process a new unvisited point, which leads to identifying a new cluster or noise. This procedure repeats until the algorithm visits all the points. Because of this, the label of each point is either a cluster or noise.
Advantages and Disadvantages of DBSCAN Clustering
DBSCAN has several significant advantages over other clustering techniques. For starters, it does not necessitate a predetermined number of clusters. It also recognizes outliers as noise instead of mean-shift, which places them into a cluster regardless of how different the data point is. Furthermore, it is capable of locating arbitrary big and any formed clusters.
The fundamental disadvantage of DBSCAN is that it does not perform as well as others when the cluster density varies. This is because the distance threshold and minPoints for recognizing neighborhood points will differ from cluster to cluster as density varies.
This disadvantage also happens with very high-dimensional data since estimating the distance threshold becomes difficult.
4. Expectation – Maximization (EM) Clustering using Gaussian Mixture Models (GMM)
The naive use of the mean value for the cluster center is one of K-Means’ key shortcomings. Look at the image below, you can see why this isn’t the ideal method to go about things.
To the normal eye, there appear to be two circular clusters with differing radii’ centered at the same mean on the left. K-Means cannot handle this since the clusters’ mean values are so close together. K-Means also fail when the clusters are not circular, due to the use of the mean as the cluster center.
GMMs (Gaussian Mixture Models) provide more flexibility than K-Means. We assume that the data points are Gaussian distributed when using GMMs; this is a less restrictive assumption than claiming they are circular when using the mean.
As a result, we have two factors that define the geometry of the clusters: mean and standard deviation! Taking a two-dimensional example, this means that the clusters can have an elliptical shape (since we have a standard deviation in both the x and y directions). As a result, each Gaussian distribution is assigned to only one cluster.
Advantage and Disadvantages EM using GMM Clustering
There are two major benefits to using GMMs. For starters, GMMs are far more adaptable in terms of cluster covariance than K-Means; because to the standard deviation parameter, the clusters can take on any elliptical form rather than being limited to circles.
K-Means is a subset of GMM in which the covariance of each cluster along all dimensions approaches zero. Second, because GMMs use probabilities, each data point can have several clusters. So, if a data point falls in the middle of two overlapping clusters, we may describe its class as belonging X% to class 1 and Y% to class 2. GMMs, for example, allow for a diverse membership.
Clustering algorithms are an important aspect of data science and hence have a role in data mining. Any aspiring data scientist interested in a career in data science should be familiar with the clustering methods outlined above.
The topic of cluster algorithms is vast, and each person’s approach is unique. You should be aware that there is no one-size-fits-all answer. Each algorithm must be viewed as a separate tool. Every strategy does not function equally effectively in every case.