Clustering — Synthetic Blobs¶
Synthetic blobs (300 samples, 4 features, 3 well-separated Gaussian clusters) make the full clustering workflow easy to follow. This tutorial trains a map, picks the number of clusters with the elbow and silhouette diagnostics, clusters the neurons, draws the cluster map, and compares the three algorithms objectively. The integer blob ids serve as a known ground-truth class for the classification map.
Note
Full runnable notebook: notebooks/clustering.ipynb. The figures below are its outputs.
1. Generate and standardize the data¶
The BMU search compares raw feature distances, so standardizing is essential.
import torch
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
X, y = make_blobs(n_samples=300, centers=3, n_features=4, random_state=42)
features = torch.tensor(
StandardScaler().fit_transform(X), dtype=torch.float32
)
targets = torch.tensor(y, dtype=torch.long) # 0, 1, 2
2. Train the SOM¶
from torchsom import SOM
som = SOM(
x=25,
y=15,
num_features=features.shape[1],
epochs=100,
batch_size=16,
sigma=1.45,
learning_rate=0.95,
neighborhood_order=3,
topology="rectangular",
initialization_mode="pca",
random_seed=42,
)
som.initialize_weights(data=features, mode=som.initialization_mode)
q_errors, t_errors = som.fit(data=features)
3. Check convergence¶
from torchsom import SOMVisualizer
viz = SOMVisualizer(som=som)
viz.plot_training_errors(
quantization_errors=q_errors, topographic_errors=t_errors
)
Both errors fall and flatten — training is long enough.
4. Map structure¶
The U-matrix exposes cluster boundaries; the hit map shows where the data lands.
viz.plot_distance_map(
distance_metric=som.distance_fn_name,
neighborhood_order=som.neighborhood_order,
)
viz.plot_hit_map(data=features)
The U-matrix shows three basins of low inter-neuron distance separated by clear ridges, matching the three Gaussian clusters in the data.
5. Ground-truth classes¶
Build the BMU→sample map once, then color each neuron by its dominant blob id.
bmus_map = som.build_map("bmus_data", data=features)
viz.plot_classification_map(
bmus_data_map=bmus_map,
data=features,
target=targets,
neighborhood_order=som.neighborhood_order,
)
The three blobs occupy distinct, contiguous regions of the grid, confirming the map has preserved the cluster structure.
6. Choose the number of clusters¶
The elbow plot tracks within-cluster dispersion against k; the bend marks a good
choice.
viz.plot_elbow_analysis(max_k=10, feature_space="weights")
The curve bends sharply at k=3, agreeing with the three basins seen in the U-matrix.
7. Cluster the neurons¶
With k=3 chosen, cluster the codebook vectors and draw the result.
cluster() returns a dictionary; pass it straight to the
visualizer. The silhouette plot reports how cleanly each neuron sits in its cluster.
result = som.cluster(method="kmeans", n_clusters=3, feature_space="weights")
viz.plot_cluster_map(cluster_result=result)
viz.plot_silhouette_analysis(cluster_result=result)
The cluster map’s boundaries follow the U-matrix ridges, so the three neuron groups
line up with the data’s natural separation. feature_space can be "weights",
"positions", or "combined" — see Clustering for the
trade-offs.
8. Compare algorithms¶
Rather than picking by eye, score K-Means, GMM, and HDBSCAN side by side. n_clusters
is ignored by HDBSCAN, which finds k itself.
results = [
som.cluster(method=m, feature_space="weights")
for m in ("kmeans", "gmm", "hdbscan")
]
viz.plot_cluster_quality_comparison(results_list=results)
The panel scores each method with silhouette, Davies–Bouldin, and Calinski–Harabasz, so the final choice is driven by metrics rather than appearance.
Next steps¶
Clustering — The clustering API and feature spaces in full
Visualization Gallery — Every plot explained
Iris — Classification — A classification tutorial on the Iris dataset