Density Based Clustering

Density-based clustering utilizes machine learning methods without supervision to track down distinct clusters within a dataset. In these methodologies, the concept is that within a dataset, a particular group or cluster is represented by an unbroken region with a high number of points, separated from other clusters by sparser regions. The data points based in these separations are typically viewed as noise or outliers. Clustering is a vital issue in data analytics, dynamic in multiple applications, which include the identification of problematic servers, gene categorization based on expression patterns, outlier detection in biological images, and several others. Commonly known data clustering algorithms include DBSCAN and k-Means, with K-Means assigning points to the closest centroid.

Use Cases

In urban water distribution networks, potential problems like pipe breakage could be signaled by clusters of burst incidents. Using density-based clustering methods, engineers can uncover these problematic spots and implement preemptive actions in critical areas of water supply systems. In basketball, using position data of successful and failed NBA shots, the methodology of density-based clustering can expose distinct patterns of successful and failed shot attempts per player, shaping the game's tactics.

Using a point dataset, representing each house in the research region, houses affected by pests and those that aren't can be distinguished. The density-based clustering can locate concentration spots of affected houses, aiding in the formulation of effective treatment and eradication strategies.

In disaster or security crisis management, geo-tracking tweets can help determine evacuation and rescue needs based on cluster size and position.

Methods

The density-based clustering tool has three modes to find the clusters in your point data:

Defined distance (DBSCAN), differentiating dense clusters and sparser noise, with a clear Search Distance applying to all possible clusters for effective operation.

Self-adjusting (HDBSCAN) uses diverse distances to separate different density clusters from noise with sparser coverage.

Multi-scale (OPTICS) employs the distance between closer features to construct a reachability plot, further used to differentiate clusters of varying densities from noise.

The Density-based clustering algorithm pinpoints areas where points are aggregated and those separated by empty or sparser regions. Points not affiliated with a cluster are termed as noise. The tool uses unsupervised machine learning clustering methodologies to automatically find patterns based on physical location and distance to a specific number of neighbors. These methodologies are seen as unsupervised since they don't require training on the interpretation of a cluster.

Use Cases

Methods

Detect hidden vulnerabilities in ML models, from tabular to LLMs, before moving to production.