Clustering — Synthetic Blobs

Synthetic blobs (300 samples, 4 features, 3 well-separated Gaussian clusters) make the full clustering workflow easy to follow. This tutorial trains a map, picks the number of clusters with the elbow and silhouette diagnostics, clusters the neurons, draws the cluster map, and compares the three algorithms objectively. The integer blob ids serve as a known ground-truth class for the classification map.

Note

Full runnable notebook: notebooks/clustering.ipynb. The figures below are its outputs.

1. Generate and standardize the data

The BMU search compares raw feature distances, so standardizing is essential.

import torch
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler

X, y = make_blobs(n_samples=300, centers=3, n_features=4, random_state=42)
features = torch.tensor(
    StandardScaler().fit_transform(X), dtype=torch.float32
)
targets = torch.tensor(y, dtype=torch.long)   # 0, 1, 2

2. Train the SOM

from torchsom import SOM

som = SOM(
    x=25,
    y=15,
    num_features=features.shape[1],
    epochs=100,
    batch_size=16,
    sigma=1.45,
    learning_rate=0.95,
    neighborhood_order=3,
    topology="rectangular",
    initialization_mode="pca",
    random_seed=42,
)
som.initialize_weights(data=features, mode=som.initialization_mode)
q_errors, t_errors = som.fit(data=features)

3. Check convergence

from torchsom import SOMVisualizer

viz = SOMVisualizer(som=som)
viz.plot_training_errors(
    quantization_errors=q_errors, topographic_errors=t_errors
)
Blobs training curve

Both errors fall and flatten — training is long enough.

4. Map structure

The U-matrix exposes cluster boundaries; the hit map shows where the data lands.

viz.plot_distance_map(
    distance_metric=som.distance_fn_name,
    neighborhood_order=som.neighborhood_order,
)
viz.plot_hit_map(data=features)

The U-matrix shows three basins of low inter-neuron distance separated by clear ridges, matching the three Gaussian clusters in the data.

5. Ground-truth classes

Build the BMU→sample map once, then color each neuron by its dominant blob id.

bmus_map = som.build_map("bmus_data", data=features)
viz.plot_classification_map(
    bmus_data_map=bmus_map,
    data=features,
    target=targets,
    neighborhood_order=som.neighborhood_order,
)
Blobs classification map

The three blobs occupy distinct, contiguous regions of the grid, confirming the map has preserved the cluster structure.

6. Choose the number of clusters

The elbow plot tracks within-cluster dispersion against k; the bend marks a good choice.

viz.plot_elbow_analysis(max_k=10, feature_space="weights")
Blobs elbow analysis

The curve bends sharply at k=3, agreeing with the three basins seen in the U-matrix.

7. Cluster the neurons

With k=3 chosen, cluster the codebook vectors and draw the result. cluster() returns a dictionary; pass it straight to the visualizer. The silhouette plot reports how cleanly each neuron sits in its cluster.

result = som.cluster(method="kmeans", n_clusters=3, feature_space="weights")
viz.plot_cluster_map(cluster_result=result)
viz.plot_silhouette_analysis(cluster_result=result)

The cluster map’s boundaries follow the U-matrix ridges, so the three neuron groups line up with the data’s natural separation. feature_space can be "weights", "positions", or "combined" — see Clustering for the trade-offs.

8. Compare algorithms

Rather than picking by eye, score K-Means, GMM, and HDBSCAN side by side. n_clusters is ignored by HDBSCAN, which finds k itself.

results = [
    som.cluster(method=m, feature_space="weights")
    for m in ("kmeans", "gmm", "hdbscan")
]
viz.plot_cluster_quality_comparison(results_list=results)
Blobs clustering metrics comparison

The panel scores each method with silhouette, Davies–Bouldin, and Calinski–Harabasz, so the final choice is driven by metrics rather than appearance.

Next steps