G

Clustering Algorithms

Understanding Clustering

Clustering is a type of unsupervised machine learning task that involves the grouping of data. Detecting unseen patterns within the data helps in forming these clusters, collating datapoints demonstrating analogous patterns into shared clusters. One of the primary benefits of clustering is its ability to interpret untagged data.

Untagged vs Tagged Data

Untagged data is profuse and relatively easy to accumulate. It might manifest as a curated gallery of images from the internet, a suite of posts from social media, or any other non-annotated databank of data points. Contrastingly, tagged data comes affixed with labels which make them notably more resourceful; however, obtaining such data is a labor-intensive process that generally demands human annotators painstakingly tagging each data point.

The type of data – whether tagged or untagged – influences the choice of algorithms workable for learning. Hence, machine learning algorithms can be bifurcated into supervised and unsupervised learning based on the nature of data. Labelled data facilitates supervised machine learning by providing cues for learning, resulting in mapping data points to labels. Unsupervised learning does not benefit from the cues provided by data labels. Instead, it leverages statistical techniques to fabricate tags.

Clustering Algorithm Types

The field of clustering algorithms is extensive, and the lack of consensus on their categorization often leads to different sources adopting divergent criteria. From a practical standpoint, two types of classifications are effective:

i) Based on the count of clusters to which a data point may belongii) Based on the shapes of clusters resultant from an algorithm.

The primary discrepancy lies between "hard" and "soft" clustering tactics. Hard clustering limits data points membership to a single cluster, whereas soft clustering allows affiliation to multiple clusters at varying degrees of association, a critical aspect depending on the need for rigid or overlapping clusters.

The second classification is attributed to the shape and categories of clusters an algorithm yields. Hierarchical, centroid-based, and density-based models rank among the prominent cluster types under this classification.

Factors such as the size of the data you are dealing with can significantly impact the algorithm's performance. Hierarchical clustering methods struggle with larger datasets due to their cubic time complexity. In such instances, switching from a hierarchical model to a centroid-based technique like the k-means could be a more favorable choice given its comparatively lesser runtime.

Also worth considering is the nature of the data at hand. Several clustering algorithms form clusters using a distance-based metric necessitating the dataset's traits to be numeric. While categorical variables can be converted to binary values, calculating distances between them is largely irrelevant. For such instances, k-modes clustering is a viable option, designed to handle both numeric and categorical data, or consider a different approach entirely.

Conclusion

Practicing clustering within the realm of data science involves a measure of trial and error. The outcome of a clustering algorithm might not instantly make sense. Assess the relevance of the generated clusters before deciding whether to attempt another method. As clustering represents one of the earliest and most researched machine learning methods, there are numerous algorithms to choose from. To cut down on the trial-and-error process, it helps to understand the advantages of various algorithms, such as the kind of tasks, data they are best suited for, and the type of clusters they produce.

Integrate | Scan | Test | Automate

Detect hidden vulnerabilities in ML models, from tabular to LLMs, before moving to production.