3.3 Clustering: Unsupervied Classification#

Clustering is a form of unsupervised classification because the goal is to discover structure on the basis of data features. Its primary objective is to unveil inherent structures within datasets based on their features. In essence, clustering endeavors to identify coherent subgroups among the observations.

The notion of homogeneity within a group is quantified through the concept of ‘distance’ between data points. This metric plays a pivotal role in the clustering process, aiding in the discernment of how similar or dissimilar data points are from one another.

This tutorial does not cover all possible clustering methods. No single clustering method works best for all scenario, it depends strongly on the inherent data struture. A simple summary, with toy examples of 2D data structure, is available in the sklearn package.

This lecture focuses on fundamental concepts that are relevant to the geosciences: 1) definition of distance and illustrate the most basics concepts using most popular clustering algorithm 2) K-means clustering and 3) agglomerative clustering.

1. Distance#

Understanding the concept of distance is crucial in machine learning, especially when it comes to clustering in the geosciences. Distance serves as a fundamental measure of dissimilarity or similarity between data points. Distance is also later used for calculating the loss and cost when training deep learning models.

There are various ways to estimate and quantify distance between data points. Some of the most commonly used distance metrics include:

  • Euclidean Distance: This is the straight-line distance between two data points in a multidimensional space. It is often used when the data features have similar units or scales. \(d(\mathbf{x},\mathbf{y}) = \sqrt{ \sum_{i=1}^N (x_i-y_i)^2 }/N\)

  • Manhattan Distance: Also known as the ‘L1’ distance, it measures the sum of the absolute differences between corresponding elements of two data points. It’s suitable when movement along axes is restricted, such as in grid-based data. \(d(\mathbf{x},\mathbf{y}) = \sum_{i=1}^N |x_i-y_i| /N\)

  • Geodesic Distance: Geodesic distance measures the shortest path between two points on the surface of a sphere, which is crucial for studying Earth’s geography, including GPS navigation and geodetic measurements.

  • Correlation-Based Distances: In geophysical and geospatial data analysis, correlation-based distance metrics like Pearson correlation or Spearman rank correlation are often used to assess relationships between variables.

  • Covariance-Based Distances: These distances, which take into account the spatial covariance or variogram models, are prevalent in geostatistics and spatial analysis.

  • Cosine Similarity: This metric calculates the cosine of the angle between two data vectors, providing a measure of their similarity, particularly in high-dimensional spaces. It’s frequently used for text or image data.

Scikit-learn contains the most commonly used metrics in the context of Classic ML under the package metrics.DistanceMetric. More details in this scikit-learn documentation.

Relation to PCA Clustering and PCA both simplify the data via a small number of summaries. But the differences are:

  • PCA seeks to reduce the dimensionality of the data, to find a low-dimensional representation of the data that explains a good fraction of the data variance,

  • Clustering seeks to find homogeneous groups within the observations.

In fact, it is common to combine both for complex and high dimensional data: 1) PCA, 2) clustering on the PCs.

There are two main methods using clustering, k-means clustering and hierarchical clustering.

The toolbox scikit-learn has a collection of clustering algorithms and detailed documentation with tutorials.

2. Tutorial set up#

Import useful Python packages

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random
import wget
# from math import cos, sin, pi, sqrt
# from mpl_toolkits.mplot3d import Axes3D
from sklearn import preprocessing
from sklearn.decomposition import PCA
import plotly.express as px
import os

Data#

We will use flow cytometer data.

The data file was provided by Katherine Qi from the UW SeaFlow group. The data is abundance and optical properties of phytoplankton The file has attributes about each particle, such as normalized scatter, red, orange, and green. These are measurements from the instrument itself from light scattering. Other parameters, like diam (diameter) and Qc (carbon quota) are estimated from the light scatter measurements.

# download data
cc=wget.download("https://www.dropbox.com/s/dwa82x6xhjkhyw8/ug3_FCM_distribution.feather?dl=1")

read the data

# this is some underway data collected from a cruise in 2019
os.replace("ug3_FCM_distribution.feather",'../../../ug3_FCM_distribution.feather')
underway_g3 = pd.read_feather("../../../ug3_FCM_distribution.feather")
print(underway_g3.columns)
underway_g3.head()
Index(['filename', 'pop', 'norm.scatter', 'norm.red', 'norm.orange',
       'norm.green', 'diam_lwr', 'Qc_lwr', 'diam_mid', 'Qc_mid', 'diam_upr',
       'Qc_upr', 'date', 'file', 'label', 'lat', 'lon', 'depth', 'replicate',
       'volume', 'stain', 'flag', 'comments'],
      dtype='object')
filename pop norm.scatter norm.red norm.orange norm.green diam_lwr Qc_lwr diam_mid Qc_mid ... file label lat lon depth replicate volume stain flag comments
0 towfish_001.fcs prochloro 0.009565 0.008967 0.001417 0.003938 0.540813 0.030637 0.653386 0.049902 ... towfish_001 T20.28.05 30.0935 -158.0003 5 A 173 0 2 pro peak near low fluor noise
1 towfish_001.fcs prochloro 0.008924 0.015568 0.001417 0.003938 0.532844 0.029486 0.643588 0.047994 ... towfish_001 T20.28.05 30.0935 -158.0003 5 A 173 0 2 pro peak near low fluor noise
2 towfish_001.fcs prochloro 0.005311 0.011982 0.001417 0.003938 0.478654 0.022358 0.577122 0.036229 ... towfish_001 T20.28.05 30.0935 -158.0003 5 A 173 0 2 pro peak near low fluor noise
3 towfish_001.fcs picoeuk 0.925120 1.580796 0.003468 0.003938 1.548094 0.462018 2.003069 0.898166 ... towfish_001 T20.28.05 30.0935 -158.0003 5 A 173 0 2 pro peak near low fluor noise
4 towfish_001.fcs picoeuk 0.277725 0.207073 0.001417 0.003938 1.136778 0.208269 1.416857 0.367623 ... towfish_001 T20.28.05 30.0935 -158.0003 5 A 173 0 2 pro peak near low fluor noise

5 rows × 23 columns

files = list(pd.unique(underway_g3['filename']))
print(files)
['towfish_001.fcs', 'towfish_002.fcs', 'towfish_003.fcs', 'towfish_004.fcs', 'towfish_005.fcs', 'towfish_006.fcs', 'towfish_007.fcs', 'towfish_008.fcs', 'towfish_009.fcs', 'towfish_010.fcs', 'towfish_011.fcs', 'towfish_012.fcs', 'towfish_013.fcs', 'towfish_014.fcs', 'towfish_015.fcs', 'towfish_016.fcs', 'towfish_017.fcs', 'towfish_018.fcs', 'towfish_019.fcs', 'towfish_020.fcs', 'underway_002.fcs', 'underway_003.fcs', 'underway_004.fcs', 'underway_005.fcs', 'underway_006.fcs', 'underway_007.fcs', 'underway_008.fcs', 'underway_009.fcs', 'underway_010.fcs', 'underway_011.fcs', 'underway_012.fcs', 'underway_013.fcs', 'underway_014.fcs', 'underway_015.fcs', 'underway_017.fcs', 'underway_018.fcs', 'underway_019.fcs', 'underway_020.fcs', 'underway_021.fcs', 'underway_022.fcs', 'underway_023.fcs', 'underway_024.fcs', 'underway_025.fcs', 'underway_026.fcs', 'underway_027.fcs', 'underway_028.fcs', 'underway_029.fcs', 'underway_030.fcs', 'underway_031.fcs', 'underway_032.fcs', 'underway_033.fcs', 'underway_034.fcs', 'underway_035.fcs', 'underway_036.fcs', 'underway_037.fcs', 'underway_038.fcs', 'underway_039.fcs', 'underway_040.fcs', 'underway_041.fcs', 'underway_042.fcs', 'underway_043.fcs', 'underway_044.fcs', 'underway_045.fcs', 'underway_046.fcs', 'underway_047.fcs', 'underway_048.fcs', 'underway_049.fcs', 'underway_050.fcs', 'underway_051.fcs', 'underway_052.fcs', 'underway_053.fcs', 'underway_054.fcs', 'underway_055.fcs', 'underway_056.fcs', 'underway_057.fcs', 'underway_058.fcs', 'underway_059.fcs', 'underway_060.fcs', 'underway_061.fcs', 'underway_062.fcs', 'underway_063.fcs', 'underway_064.fcs', 'underway_065.fcs', 'underway_066.fcs', 'underway_067.fcs', 'underway_068.fcs', 'underway_069.fcs', 'underway_070.fcs', 'underway_071.fcs', 'underway_072.fcs', 'underway_073.fcs', 'underway_074.fcs', 'underway_075.fcs', 'underway_076.fcs', 'underway_077.fcs', 'underway_078.fcs', 'underway_079.fcs', 'underway_080.fcs', 'underway_081.fcs', 'underway_082.fcs', 'underway_083.fcs', 'underway_084.fcs', 'underway_085.fcs', 'underway_086.fcs', 'underway_087.fcs', 'underway_088.fcs']

Each file is 1 sample taken at a certain time, location, and depth. There are also replicates, or even triplicates, run on the same spatiotemporal scale to get uncertainty estimations on the instrument. We can either ignore the replicates/triplicates or take the mean.

test1 = underway_g3[underway_g3['filename']==files[1]]
test1.head()
filename pop norm.scatter norm.red norm.orange norm.green diam_lwr Qc_lwr diam_mid Qc_mid ... file label lat lon depth replicate volume stain flag comments
4770 towfish_002.fcs prochloro 0.008020 0.019802 0.001535 0.004364 0.521334 0.027870 0.629445 0.045320 ... towfish_002 dusk 31.091 -158.0009 5 A 295 0 0 pro peak near low fluor noise
4771 towfish_002.fcs prochloro 0.007402 0.013791 0.001535 0.004364 0.512479 0.026665 0.618575 0.043329 ... towfish_002 dusk 31.091 -158.0009 5 A 295 0 0 pro peak near low fluor noise
4772 towfish_002.fcs prochloro 0.008020 0.014121 0.001535 0.004364 0.521334 0.027870 0.629445 0.045320 ... towfish_002 dusk 31.091 -158.0009 5 A 295 0 0 pro peak near low fluor noise
4773 towfish_002.fcs prochloro 0.008834 0.014904 0.001535 0.004364 0.531600 0.029308 0.642055 0.047700 ... towfish_002 dusk 31.091 -158.0009 5 A 295 0 0 pro peak near low fluor noise
4774 towfish_002.fcs prochloro 0.013423 0.005825 0.001535 0.004364 0.580362 0.036756 0.702153 0.060086 ... towfish_002 dusk 31.091 -158.0009 5 A 295 0 0 pro peak near low fluor noise

5 rows × 23 columns

px.scatter(test1, 'norm.scatter','norm.orange',color='pop',log_x=True,log_y=True)

Typically, populations are gated based on their log-transformed normalized scatter (x) and red (y). Orange flourescence can be used in addition to gate the synecho population. We can see there are 3 populations here, and these were previously gated from the parameters above.

import seaborn as sns
sns.set_theme(style="ticks")
df = test1[['norm.scatter','norm.red','norm.orange','norm.green','depth']]
df['norm.scatter'] = np.log10(df['norm.scatter'])
df['norm.red'] = np.log10(df['norm.red'])
df['norm.orange'] = np.log10(df['norm.orange'])
sns.pairplot(df, hue="norm.green")
/var/folders/js/lzmy975n0l5bjbmr9db291m00000gn/T/ipykernel_13460/123783465.py:4: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/var/folders/js/lzmy975n0l5bjbmr9db291m00000gn/T/ipykernel_13460/123783465.py:5: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/var/folders/js/lzmy975n0l5bjbmr9db291m00000gn/T/ipykernel_13460/123783465.py:6: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
<seaborn.axisgrid.PairGrid at 0x32ce93610>
../_images/3.3_clustering_14_2.png
sns.set_theme(style="ticks")
df = test1[['norm.scatter','norm.red','norm.orange','pop']]
df['norm.scatter'] = np.log10(df['norm.scatter'])
df['norm.red'] = np.log10(df['norm.red'])
df['norm.orange'] = np.log10(df['norm.orange'])
sns.pairplot(df, hue="pop")
/var/folders/js/lzmy975n0l5bjbmr9db291m00000gn/T/ipykernel_13460/3931145617.py:3: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/var/folders/js/lzmy975n0l5bjbmr9db291m00000gn/T/ipykernel_13460/3931145617.py:4: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/var/folders/js/lzmy975n0l5bjbmr9db291m00000gn/T/ipykernel_13460/3931145617.py:5: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
<seaborn.axisgrid.PairGrid at 0x32cea10d0>
../_images/3.3_clustering_15_2.png

3. K-means#

K-means is an unsupervised clustering method. The main idea is to separate the data into K distinct clusters. We then have two problems to solve. First, we need to find the k centroids of the k clusters. Then, we need to affect each data point to the cluster which centroid is the closest to the data point.

The goal is to partition \(n\) data points into \(k\) clusters. Each observation is labeled to a cluster with the nearest mean.

K-means is iterative:

  1. assume initial values for the mean of each of the \(k\) clusters

  2. Compute the distance for each observation to each of the \(k\) means

  3. Label each observation as belonging to the nearest means

  4. Find the center of mass (mean) of each group of labeled points. These are new means to step 1.

In the following, we denote \(n\) the number of data points, and \(p\) the number of features for each data point.

K-means only works with the Euclidian distance metrics. In fact, its core concept is to use the euclidian distance to measure and minimize inertia or within cluster variation.

n = len(df)
p = 3 #()
print('We have {:d} data points, and each one has {:d} features'.format(n, p))
We have 18470 data points, and each one has 3 features

Let’s define a numpy array with the 3 features and all of the data

data = np.zeros(shape=(n,p))
data[:,0] = df['norm.scatter']
data[:,1] = df['norm.red']
data[:,2] = df['norm.orange']

Let us define a function to initialize the centroid of the clusters. We choose random points within the range of values taken by the data.

k
7
def init_centers(data, k):
    """
    """
    # Initialize centroids
    centers = np.zeros((k, np.shape(data)[1]))
    # Loop on k centers
    for i in range(0, k):
        # Generate p random values between 0 and 1
        dist = np.random.uniform(size=np.shape(data)[1])
        # Use the random values to generate a point within the range of values taken by the data
        centers[i, :] = np.min(data, axis=0) + (np.max(data, axis=0) - np.min(data, axis=0)) * dist
    return centers

To be able to affect each data point to the closest centroid, we need to define the distance between two data points. The most common distance is the Euclidean distance:

\(d(x,y) = \sqrt{\sum_{i = 1}^p (x_i - y_i)^2}\)

where \(x\) and \(y\) are two data observation points with \(p\) variables.

We then define a function to compute the distance between each data point and each centroid.

def compute_distance(data, centers, k):
    """
    """
    # Initialize distance
    distance = np.zeros((np.shape(data)[0], k))
    # Loop on n data points
    for i in range(0, np.shape(data)[0]):
        # Loop on k centroids
        for j in range(0, k):
            # Compute distance
            distance[i, j] = np.sqrt(np.sum(np.square(data[i, :] - centers[j, :])))
    return distance

We now define a function to affect each data point to the cluster which centroid is the closest to the point. We also define an objective function that will be minimized until we reach convergence.

Our objective is to minimize the sum of the square of the distance between each point and the closest centroid:

\(obj = \sum_{j = 1}^k \sum_{i = 1}^{N_j} d(x^{(i)} , x^{(j)}) ^2\)

where \(x^{(i)}\) is the \(i^{th}\) point in the cluster \(j\), \(x^{(j)}\) is the centroid of the cluster \(j\), and \(N_j\) is the number of points in the cluster \(j\).

def compute_objective(distance, clusters):
    """
    """
    # Initialize objective
    objective = 0.0
    # Loop on n data points
    for i in range(0, np.shape(distance)[0]):
        # Add distance to the closest centroid
        objective = objective + distance[i, int(clusters[i])] ** 2.0
    return objective
def compute_clusters(distance):
    """
    """
    # Initialize clusters
    clusters = np.zeros(np.shape(distance)[0])
    # Loop on n data points
    for i in range(0, np.shape(distance)[0]):
        # Find closest centroid
        best = np.argmin(distance[i, :])
        # Assign data point to corresponding cluster
        clusters[i] = best
    return clusters

After all points are assigned to a cluster, compute the new location of the centroid. It is just the value of the mean of all the points affected to that cluster:

For \(1 \leq j \leq k\), \(x_p^{(j)} = \frac{1}{N_j} \sum_{i = 1}^{N_j} x_p^{(i)}\)

def compute_centers(data, clusters, k):
    """
    """
    # Initialize centroids
    centers = np.zeros((k, np.shape(data)[1]))
    # Loop on clusters
    for i in range(0, k):
        # Select all data points in this cluster
        subdata = data[clusters == i, :]
        # If no data point in this cluster, generate randomly a new centroid
        if (np.shape(subdata)[0] == 0):
            centers[i, :] = init_centers(data, 1)
        else:
            # Compute the mean location of all data points in this cluster
            centers[i, :] = np.mean(subdata, axis=0)
    return centers

We can now code the K-means algorithm by assembling all these functions. We stop the computation when the objective function no longer decreases.

def my_kmeans(data, k):
    """
    """
    # Initialize centroids
    centers = init_centers(data, k)
    # Initialize objective function to square of the maximum distance between two data points times number of data points
    objective_old = np.shape(data)[0] * np.sum(np.square(np.max(data, axis=0) - np.min(data, axis=0)))
    # Initialize clusters
    clusters_old = np.zeros(np.shape(data)[0])
    # Start loop until convergence
    stop_alg = False
    while stop_alg == False:
        # Compute distance between data points and centroids
        distance = compute_distance(data, centers, k)
        # Get new clusters
        clusters_new = compute_clusters(distance)
        # get new value of objective function
        objective_new = compute_objective(distance, clusters_new)
        # If objective function stops decreasing, end loop
        if objective_new >= objective_old:
            return (clusters_old, objective_old, centers)
        else:
            # Update the locations of the centroids
            centers = compute_centers(data, clusters_new, k)
            objective_old = objective_new
            clusters_old = clusters_new

Run K-means with 4 clusters

k = 7
(clusters, objective, centers) = my_kmeans(data, k)
df["clusterID"]=clusters.astype('str')
fig = px.scatter_3d(df, x='norm.scatter', y='norm.red', z='norm.orange',
              color='clusterID')
fig.show()
# fig = plt.figure(figsize=(10, 10))
# ax = fig.add_subplot(111, projection='3d')
# ax.scatter(data[:, 1], data[:, 0], -data[:, 2], c=clusters)
# ax.scatter(centers[:, 1], centers[:, 0], -centers[:, 2], marker='o', s=300, c='black')
# ax.set_xlabel('norm.scatterer')
# ax.set_ylabel('norm.red')
# ax.set_ylabel('norm.orange')
# plt.title('Clusters for cytometer data')
/var/folders/js/lzmy975n0l5bjbmr9db291m00000gn/T/ipykernel_13460/2069212652.py:3: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

K-means with Scikit learn#

We will now use scikit learn toolbox to run kmeans. Follow that tutorial and apply it to your problem.

df
norm.scatter norm.red norm.orange pop clusterID
4770 -2.095846 -1.703296 -2.813916 prochloro 0.0
4771 -2.130636 -1.860400 -2.813916 prochloro 0.0
4772 -2.095846 -1.850147 -2.813916 prochloro 0.0
4773 -2.053854 -1.826709 -2.813916 prochloro 0.0
4774 -1.872152 -2.234668 -2.813916 prochloro 0.0
... ... ... ... ... ...
23235 -1.812582 -1.821948 -2.813916 prochloro 0.0
23236 -1.895101 -1.754993 -2.813916 prochloro 0.0
23237 -2.146871 -1.946094 -2.813916 prochloro 0.0
23238 -2.034872 -1.918567 -2.813916 prochloro 0.0
23239 -1.948812 -2.087634 -2.813916 prochloro 0.0

18470 rows × 5 columns

from sklearn.cluster import KMeans

X = data
Y = np.asarray(df['pop'])
kmeans = KMeans(n_clusters=2, random_state=0).fit(data)
kmeans.labels_
/Users/marinedenolle/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:

The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
array([0, 0, 0, ..., 0, 0, 0], dtype=int32)
# explore different pipeline


estimators = [
    ("k_means_cyto_8", KMeans(n_clusters=8)),
    ("k_means_cyto_3", KMeans(n_clusters=3)),
    ("k_means_cyto_bad_init", KMeans(n_clusters=3, n_init=1, init="random")),
]

name,est=estimators[0]
est.fit(data)
labels = est.labels_

fig = plt.figure(0, figsize=(4, 3))

df["clusterID"]=labels.astype('str')
fig = px.scatter_3d(df, x='norm.scatter', y='norm.red', z='norm.orange',
              color='clusterID')
fig.show()
/Users/marinedenolle/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:

The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning

/var/folders/js/lzmy975n0l5bjbmr9db291m00000gn/T/ipykernel_13460/3068295825.py:16: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
<Figure size 400x300 with 0 Axes>

Practical tips for k-means#

  1. How to evaluate the success of the clustering? How well separated are the clusters? There exist many ways to evaluate the quality of the clusters. Scikit-learn has summarized and packaged these tools into the module sklearn.metrics. See documentation here. Usually, high scores are better. Here is a summary:

    • If the data has ground-truth labels (e.g., population of the cyto data), we can use several metrics:

      • homogeneity (each cluster contains only members of a given class, metrics.homgeneity_score(clusterID,true_label)), completeness (all members of a given class are assigned the same cluster, metrics.completeness_score), V-measure (2 x homogeneity x completeness / (homogeneity+completeness), metrics.v_measure_score) and all three metrics.homogeneity_completeness_v_measure(clusterID,true_label).

      • Fowlkes-Mallows index, FMI, that uses TP (True Positive), FP (False Positive), FN (False Negative). Is 0.0 for random cluster assignment and 1.0 for perfect label assignments.

    • If the data does not have ground truth label, you can use:

      • silhouette coefficient using the module metrics.silhouette_score), a high score or coefficient is better. It quantifies how tight the data within the clusters are and how well separated the clusters are. More details on implementation and visualization of the silhouette score here.

  2. The number of clusters \(k\) is a tunable parameter. To find the optimal number of clusters, we discuss below a few strategies.

  3. The optimization of k-means may have local minima. The result therefore may different due to the initialiation of the clusters. It is recommended to use:

    • repeat random initialization, repeat k-means, and use the best set of cluster (one that has the lowest final error)

    • choose K-means++ initialization scheme using the sklearn parameter init='k-means++'. The algorithm selects initial cluster centroids using sampling based on an empirical probability distribution of the points’ contribution to the overall inertia.

  4. The data variance of some of the features may affect the results. It may be difficult to find clusters if some of the data features (axis) are much larger than other. The data may need to be pre-processed (centered and scaled) or pre-conditions (e.g., PCA).

In the following, we take the example of a simple data sets. The Old Faithful is a geyser in Yellowstone. The data shows the time of the geyser eruption against the waiting time for the next eruptions.

# faithful = pd.read_csv('faithful.csv')
faithful = pd.read_csv('faithful.csv')
data_faithful = faithful.to_numpy()
plt.plot(faithful.current, faithful.next, 'ko')
plt.xlabel('Eruption time in minutes')
plt.ylabel('Waiting time to next eruption')
plt.grid(True)
plt.title('Old Faithful')
Text(0.5, 1.0, 'Old Faithful')
../_images/3.3_clustering_40_1.png

Silhouette analysis

The silhouette coefficient and silhouette score are metrics used to assess the quality of clustering in unsupervised learning.

  1. Silhouette Coefficient: The silhouette coefficient for a data sample measures how similar it is to its own cluster (cohesion) compared to other clusters (separation). It is calculated for each data sample and ranges from -1 to 1. A high silhouette coefficient indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. Conversely, a low silhouette coefficient suggests that the object may be in the wrong cluster.

    The formula for the silhouette coefficient (s) for a single data point is given by:

    \( s(i) = \frac{b(i) - a(i)}{\max\{a(i), b(i)\}} \) where:

    • \(a(i)\) is the average distance from the \(i\)-th data point to the other data points in the same cluster.

    • \(b(i)\) is the smallest average distance from the \(i\)-th data point to data points in a different cluster, minimized over clusters.

  2. Silhouette Score: The silhouette score is the average silhouette coefficient across all data samples in the dataset. It provides a global measure of how well-separated clusters are. The silhouette score ranges from -1 to 1 as well, where a high score indicates good clustering and a low score suggests overlapping or misclassified clusters.

    The formula for the silhouette score is given by:

\( S = \frac{\sum_{i=1}^{N} s(i)}{N} \) where:

  • \(N\) is the number of data points in the dataset.

Interpretation:

  • A silhouette coefficient close to 1 indicates that the data point is well-matched to its own cluster and poorly matched to neighboring clusters, signifying a robust and distinct cluster.

  • A silhouette coefficient close to -1 suggests that the data point is possibly misclassified, as it is better matched to a neighboring cluster.

  • A silhouette coefficient around 0 indicates overlapping clusters.

For the silhouette score:

  • A score close to 1 implies well-defined clusters.

  • A score around 0 suggests overlapping clusters.

  • A negative score indicates that the majority of data points may be assigned to the wrong clusters.

In summary, silhouette analysis helps in selecting the optimal number of clusters and assessing the overall quality of clustering in unsupervised learning, providing valuable insights into the structure of the data.

# example of the silhouette score for the Old Faithful data
from sklearn.cluster import KMeans
from sklearn import metrics
from sklearn.metrics import silhouette_score, silhouette_samples

ncluster=3

kmeans_model = KMeans(n_clusters=ncluster, random_state=1).fit(data_faithful)
labels = kmeans_model.labels_
sc=silhouette_score(data_faithful, labels, metric='euclidean')
/Users/marinedenolle/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:

The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
import matplotlib.cm as cm
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(18, 7)
#The 1st subplot is the silhouette plot
# The silhouette coefficient can range from -1, 1 but in this example all
# lie within [-0.1, 1]
ax1.set_xlim([-0.1, 1])
# The (n_clusters+1)*10 is for inserting blank space between silhouette
# plots of individual clusters, to demarcate them clearly.
ax1.set_ylim([0, len(data_faithful) + (ncluster + 1) * 10])

ncluster = 2
# Initialize the clusterer with n_clusters value and a random generator
# seed of 10 for reproducibility.
clusterer = KMeans(n_clusters=ncluster, random_state=10)
cluster_labels = clusterer.fit_predict(data_faithful)

# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = silhouette_score(data_faithful, cluster_labels)
print(
    "For n_clusters =",
    ncluster,
    "The average silhouette_score is :",
    silhouette_avg,
)

# Compute the silhouette scores for each sample
sample_silhouette_values = silhouette_samples(data_faithful, cluster_labels)

y_lower = 10
for i in range(ncluster):
    # Aggregate the silhouette scores for samples belonging to
    # cluster i, and sort them
    ith_cluster_silhouette_values = sample_silhouette_values[cluster_labels == i]

    ith_cluster_silhouette_values.sort()

    size_cluster_i = ith_cluster_silhouette_values.shape[0]
    y_upper = y_lower + size_cluster_i

    color = cm.nipy_spectral(float(i) / ncluster)
    ax1.fill_betweenx(
        np.arange(y_lower, y_upper),
        0,
        ith_cluster_silhouette_values,
        facecolor=color,
        edgecolor=color,
        alpha=0.7,
    )

    # Label the silhouette plots with their cluster numbers at the middle
    ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

    # Compute the new y_lower for next plot
    y_lower = y_upper + 10  # 10 for the 0 samples

ax1.set_title("The silhouette plot for the various clusters.")
ax1.set_xlabel("The silhouette coefficient values")
ax1.set_ylabel("Cluster label")

# The vertical line for average silhouette score of all the values
ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

ax1.set_yticks([])  # Clear the yaxis labels / ticks
ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

# 2nd Plot showing the actual clusters formed
colors = cm.nipy_spectral(cluster_labels.astype(float) / ncluster)
ax2.scatter(
    data_faithful[:, 0], data_faithful[:, 1], marker="o", s=30, lw=0, alpha=0.7, c=colors, edgecolor="k"
)

# Labeling the clusters
centers = clusterer.cluster_centers_
# Draw white circles at cluster centers
ax2.scatter(
    centers[:, 0],
    centers[:, 1],
    marker="o",
    c="white",
    alpha=1,
    s=200,
    edgecolor="k",
)

for i, c in enumerate(centers):
    ax2.scatter(c[0], c[1], marker="$%d$" % i, alpha=1, s=200, edgecolor=None)

ax2.set_title("The visualization of the clustered data.")
ax2.set_xlabel("Feature space for the 1st feature")
ax2.set_ylabel("Feature space for the 2nd feature")

plt.suptitle(
    "Silhouette analysis for KMeans clustering on sample data with n_clusters = %d"
    % ncluster,
    fontsize=14,
    fontweight="bold",
)
For n_clusters = 2 The average silhouette_score is : 0.560931744795642
/Users/marinedenolle/opt/miniconda3/envs/mlgeo/lib/python3.9/site-packages/sklearn/cluster/_kmeans.py:1416: FutureWarning:

The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning
Text(0.5, 0.98, 'Silhouette analysis for KMeans clustering on sample data with n_clusters = 2')
../_images/3.3_clustering_43_3.png

3. Choice of number of clusters: The Elbow Method#

The elbow method is designed to find the optimal number of clusters. It consists in performing the clustering algorithm with an increasing number of clusters \(k\) and select and measuring the average distance between data points and the cluster centroids. There are two typical metrics in the Elbow method

  • Distortion: It is calculated as the average of the squared distances from the cluster centers of the respective clusters. Typically, the Euclidean distance metric is used.

  • Inertia: It is the sum of squared distances of samples to their closest cluster center.

For each value of \(k\), we compute the mean of the square of the distance between the data points and the centroid of the cluster to which they belong. We then plot this value as a function of \(k\). Hopefully, it decreases and then reaches a plateau. The optimal number of clusters is the value for which it attains the minimum.

Let us use a different dataset to illustrate the elbow method.

def compute_elbow(data, clusters, centers, k):
    """
    """
    E = 0
    for i in range(0, k):
        distance = compute_distance(data[clusters == i, :], centers[i, :].reshape(1, -1), 1)
        E = E + np.mean(np.square(distance))
    return E

Compute the value of E for different values of the number of clusters

E = np.zeros(8)
for k in range(1, 9):
    (clusters, objective, centers) = my_kmeans(data_faithful, k)
    E[k - 1] = compute_elbow(data_faithful, clusters, centers, k)

plt.figure(1, figsize=(3, 3))
plt.plot(np.arange(1, 9), E)
plt.xlabel('Number of clusters')
plt.ylabel('Elbow criterion')
plt.grid(True)
../_images/3.3_clustering_47_0.png

Plot \(E\) as a function of \(k\) and see where reaches a minimum.

plt.plot(np.arange(1, 9), E)
plt.xlabel('Number of clusters')
plt.ylabel('Elbow criterion')
plt.grid(True)
../_images/3.3_clustering_49_0.png

The elbow method does not always work very well. For example, see what happens when the points get closer to each other.

fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(18, 7)
alpha = 0.5
origin = np.array([3, 3])
data_shrink = origin + alpha * np.sign(data_faithful - origin) * np.power(np.abs(data_faithful - origin), 2.0)
ax1.plot(data_shrink[:, 0], data_shrink[:, 1], 'ko')

E = np.zeros(8)
for k in range(1, 9):
    (clusters, objective, centers) = my_kmeans(data_shrink, k)
    E[k - 1] = compute_elbow(data_shrink, clusters, centers, k)
ax2.plot(np.arange(1, 9), E)
[<matplotlib.lines.Line2D at 0x32dec2610>]
../_images/3.3_clustering_51_1.png

Let us see what happens when we decrease the number of data

fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(18, 7)
alpha = 0.2
indices = np.random.uniform(size=np.shape(data_faithful)[0])
subdata = data_faithful[indices < alpha, :]
ax1.plot(subdata[:, 0], subdata[:, 1], 'ko')
E = np.zeros(8)
for k in range(1, 9):
    (clusters, objective, centers) = my_kmeans(subdata, k)
    E[k - 1] = compute_elbow(subdata, clusters, centers, k)
ax2.plot(np.arange(1, 9), E)
[<matplotlib.lines.Line2D at 0x32e047bb0>]
../_images/3.3_clustering_53_1.png

3. Repeat K-means#

Result is very sensitive to the location of the initial centroid. Repeat the clustering N times and choose the clustering with the best objective function

def repeat_kmeans(data, k, N):
    """
    """
    # Initialization
    objective = np.zeros(N)
    clusters = np.zeros((N, np.shape(data)[0]))
    centers = np.zeros((N, k, np.shape(data)[1]))
    # Run K-means N times
    for i in range(0, N):
        result = my_kmeans(data, k)
        clusters[i, :] = result[0]
        objective[i] = result[1]
        centers[i, :, :] = result[2]
        print(i/N*100)
    # Choose the clustering with the best value of the objective function
    best = np.argmin(objective)
    return (clusters[best, :], objective[best], centers[best, :, :])

Repeat k-means 50 times

N = 30
(clusters, objective, centers) = repeat_kmeans(data, k, N)
fig = plt.figure(figsize=(10, 10))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(data[:, 1], data[:, 2], -data[:, 0], c=clusters)
ax.scatter(centers[:, 1], centers[:, 2], -centers[:, 0], marker='o', s=300, c='black')
0.0
3.3333333333333335
6.666666666666667
10.0
13.333333333333334
16.666666666666664
20.0
23.333333333333332
26.666666666666668
30.0
33.33333333333333
36.666666666666664
40.0
43.333333333333336
46.666666666666664
50.0
53.333333333333336
56.666666666666664
60.0
63.33333333333333
66.66666666666666
70.0
73.33333333333333
76.66666666666667
80.0
83.33333333333334
86.66666666666667
90.0
93.33333333333333
96.66666666666667
<mpl_toolkits.mplot3d.art3d.Path3DCollection at 0x32e1046a0>
../_images/3.3_clustering_57_2.png

K-means can be a slow process to calculate and there are avenues to speed this up. One solution is to do mini-batch K-means, which takes a subset of data at each iteration to construct the centroids.

4. Hierarchical Clustering#

In K-means, we use the euclidian distance and prescribe the number of clusters K.

In hierarchical clustering, we choose difference distance metrics, visualize the data structure, and then decide on the number of clusters. There are two approaches to building the hierarchy of clusters:

  • Agglomerative: each point starts in each unique cluster. data is merged in pairs as on creates a hierarchy of clusters.

  • Divisive: initially, all data is into 1 cluster. The data is recursively split into smaller and smaller clusters.

There are several types of linkages. sklearn has detailed documentation, mostly for agglomerative: The different linkages methods are:

  • Ward minimizes the sum of squared differences within all clusters. It is a variance-minimizing approach and in this sense is similar to the k-means objective function but tackled with an agglomerative hierarchical approach.

  • Maximum or complete linkage minimizes the maximum distance between observations of pairs of clusters.

  • Average linkage minimizes the average of the distances between all observations of pairs of clusters.

  • Single linkage minimizes the distance between the closest observations of pairs of clusters.

We first import relevant packages

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import rcParams
from scipy.cluster import hierarchy  #
from scipy.spatial.distance import pdist

rcParams.update({'font.size': 18})
plt.rcParams['figure.figsize'] = [12, 12]

Here, we create a fake data sets that has 2 clusters that intermingle in a few data points.

# Training and testing set sizes
n = 100 # Train

# Random ellipse 1 centered at (0,0)
x = np.random.randn(n)
y = 0.5*np.random.randn(n)

# Random ellipse 2 centered at (1,-2)
x2 = np.random.randn(n) + 1
y2 = 0.2*np.random.randn(n) - 2

# Rotate ellipse 2 by theta
theta = np.pi/4
A = np.zeros((2,2))
A[0,0] = np.cos(theta)
A[0,1] = -np.sin(theta)
A[1,0] = np.sin(theta)
A[1,1] = np.cos(theta)

x3 = A[0,0]*x2 + A[0,1]*y2
y3 = A[1,0]*x2 + A[1,1]*y2
plt.figure()
plt.plot(x[:],y[:],'ro')
plt.plot(x3[:],y3[:],'bo')
plt.show()
../_images/3.3_clustering_63_0.png

Combine these two data sets as one.

X1 = np.column_stack((x3[:],y3[:]))
X2 = np.column_stack((x[:],y[:]))
X = np.concatenate((X1,X2))

First we explore the dendograms

## Dendrograms
Y = pdist(X,metric='euclidean')
Z = hierarchy.linkage(Y,method='average')
thresh = 0.85*np.max(Z[:,2])

plt.figure()
dn = hierarchy.dendrogram(Z,p=100,color_threshold=thresh)
plt.xlabel('Data Sample Index')
plt.ylabel('Distance')
plt.title('Dendrogram with average linkage')
plt.show()
../_images/3.3_clustering_67_0.png
thresh = 0.25*np.max(Z[:,2])
plt.figure()
dn = hierarchy.dendrogram(Z,p=100,color_threshold=thresh)
plt.xlabel('Data Sample Index')
plt.ylabel('Distance')
plt.title('Dendrogram with average linkage')
plt.show()
../_images/3.3_clustering_68_0.png

Now that we have explored the structure of the data and built some intuition on how many clusters and the distribution of the data samples in the cluster.

Next, we choose a distance threshold and assign each data point to a cluster ID.

In Sci-kit learn, the entire algorithm is incorporated in the function AgglomerativeClustering sklearn doc here. The function still needs either a threshold distance or a number of cluster to perform the clustering and assign a cluster ID to each data sample.

#

from sklearn.cluster import AgglomerativeClustering
# Let's first find a reasonable distance threshod by precalculating the linkage matrix
Z = hierarchy.linkage(X, "average") 
thresh = 0.85*np.max(Z[:,2])    # choose a threshold distance
print(thresh)
# design model
model = AgglomerativeClustering(distance_threshold=thresh,linkage="average", n_clusters=None)
# fit model and predict clusters on the data samples
clusterID=model.fit_predict(X)
3.819593473920317
X
array([[ 2.45936257, -0.03027035],
       [ 2.51421234, -0.63123918],
       [ 1.69170094, -0.8621358 ],
       [ 3.19915502,  0.65102636],
       [ 2.47430791, -0.74946313],
       [ 1.81506089, -0.60011193],
       [ 3.03615155,  0.26011117],
       [ 1.8011684 , -1.1844267 ],
       [ 1.33902588, -1.62236055],
       [ 2.00250734, -0.70690948],
       [ 2.8186153 , -0.06508825],
       [ 1.48754484, -0.86196514],
       [ 1.47445292, -1.12128972],
       [ 2.42510998, -0.51273948],
       [ 1.70656649, -0.85834262],
       [ 2.3711732 , -0.76038027],
       [ 0.56740626, -2.24402808],
       [ 1.77746862, -1.15303772],
       [ 1.4065101 , -1.22156585],
       [ 1.77324146, -1.19306446],
       [ 3.49489776,  0.49642118],
       [ 1.53006985, -1.43515087],
       [ 1.98670261, -0.74802946],
       [ 2.2825179 , -0.68142056],
       [ 2.12278987, -0.77140797],
       [ 3.31437601,  0.10621476],
       [ 3.10510236,  0.46313204],
       [ 1.66173686, -1.05976275],
       [ 1.39297291, -0.9871263 ],
       [ 2.36570975, -0.81466359],
       [ 2.35937086, -0.34630775],
       [ 2.01404483, -1.30910825],
       [ 2.1555332 , -0.46902877],
       [ 3.24545301,  0.57681748],
       [ 2.01198251, -0.18168484],
       [ 2.6669247 , -0.32448904],
       [ 1.37200097, -0.71773775],
       [ 2.10632732, -0.45420176],
       [ 3.2085912 ,  0.49874521],
       [ 1.90228393, -0.84083246],
       [ 1.47247807, -0.79940533],
       [ 2.14668446, -0.954888  ],
       [ 4.27534016,  1.45407126],
       [ 2.76584079, -0.17608592],
       [ 0.83799825, -1.09520328],
       [ 1.61572848, -1.22953934],
       [ 0.94614053, -2.24187389],
       [ 2.32290335, -0.62884977],
       [ 2.09826784, -0.76767366],
       [ 2.29653002, -0.7349497 ],
       [ 3.28045579,  0.38510738],
       [ 2.255378  , -0.17393817],
       [ 2.08909495, -1.06061717],
       [ 1.58309073, -0.80376128],
       [ 2.4800964 , -0.48899757],
       [ 1.81706811, -0.67046718],
       [ 1.85820248, -1.04581055],
       [ 1.73768671, -0.6788332 ],
       [ 1.31526688, -1.51614226],
       [ 3.76064958,  0.76162063],
       [ 2.37644368, -0.4307311 ],
       [ 1.00944157, -1.74398045],
       [ 3.2232731 ,  0.25004355],
       [ 2.70047983, -0.66669581],
       [ 1.49582152, -1.1869373 ],
       [ 1.41323832, -1.56729371],
       [ 2.55751674, -0.29329092],
       [ 1.78505379, -1.07683912],
       [ 2.29382884, -0.37600083],
       [ 2.89853413, -0.12230447],
       [ 2.72734632, -0.02408308],
       [ 2.24474658, -0.40607922],
       [ 2.92627064,  0.23151463],
       [ 2.49242661, -0.2015413 ],
       [ 1.29699392, -1.41875136],
       [ 2.4895218 , -0.01296272],
       [ 2.4704035 , -0.64924328],
       [ 2.75341748,  0.29129822],
       [ 2.09580618, -1.10830927],
       [ 1.36152358, -1.54034225],
       [ 1.26955779, -1.52173859],
       [ 2.78477182,  0.29285135],
       [ 2.44108257,  0.12895428],
       [ 2.39321553, -0.50938152],
       [ 2.3574387 , -0.36239655],
       [ 2.91895303,  0.16533313],
       [ 2.56742404, -0.41029138],
       [ 2.34372824, -0.49828192],
       [ 1.72378738, -0.82106101],
       [ 2.27390588, -0.28374473],
       [ 2.45072592, -0.54606243],
       [ 3.69342801,  0.98306748],
       [ 1.84024364, -0.96951321],
       [ 3.25376626,  0.51801462],
       [ 2.97723461, -0.18639213],
       [ 2.66737871, -0.3163218 ],
       [ 0.72931819, -1.93650528],
       [ 3.60412393,  0.72025911],
       [ 1.88902463, -1.31995987],
       [ 1.30616591, -1.64807234],
       [ 1.91769136, -1.47997688],
       [-0.62045091, -0.15649904],
       [ 0.31539272, -0.02607711],
       [-0.05036297,  0.40770129],
       [-0.13184943,  0.44544984],
       [ 1.55273533,  0.15978747],
       [-0.86453037, -0.52673042],
       [-0.37343554,  0.2034149 ],
       [-1.2852517 ,  0.01990729],
       [-0.60513575,  0.29642553],
       [-0.68057477,  0.42623196],
       [-1.17126397, -0.47916998],
       [-0.08746706, -0.4136312 ],
       [-0.86838935, -0.08450215],
       [-0.95680589,  0.44655514],
       [ 0.22747901, -0.33452403],
       [ 1.42863071, -0.92152105],
       [ 0.0843507 ,  0.57633521],
       [-0.96808491,  0.09955549],
       [-0.23348557,  0.44242584],
       [-0.54424957, -0.33296802],
       [ 2.61641585, -0.20449335],
       [ 0.395836  , -0.06974437],
       [-0.17917775, -0.61866227],
       [ 0.2676677 , -0.08316846],
       [ 0.83127444, -0.75802285],
       [ 1.17292704, -0.38956733],
       [ 0.24942912,  0.08659276],
       [ 0.68315025, -0.12318703],
       [ 0.68249439,  0.4016653 ],
       [-0.09778841, -0.62903713],
       [-0.38363333, -0.3867462 ],
       [-0.34355563,  0.41918705],
       [-0.71272008,  0.43550663],
       [ 1.16816061, -0.82389698],
       [ 0.76609361,  0.55744071],
       [-0.38711457, -0.28558898],
       [-0.59874586, -0.15775582],
       [-0.45197575, -0.20601065],
       [-1.54341151, -0.25176801],
       [-0.50616517,  0.5484758 ],
       [-0.24247536, -0.49919637],
       [ 1.02128252,  1.22996644],
       [-2.92853788,  0.35008064],
       [ 1.09452148,  0.58880268],
       [ 2.63831218,  0.63547502],
       [-0.22026551, -0.03106863],
       [-0.00602976, -1.1126141 ],
       [ 0.21841222, -0.56868936],
       [-0.54889101,  0.31459116],
       [-0.62517175, -0.13081701],
       [-0.80136318, -0.40085101],
       [ 0.79636885, -0.51473645],
       [-1.20892896,  0.19189252],
       [-0.46560357, -0.79439709],
       [-0.48164566,  0.422042  ],
       [ 0.98113664, -0.44177392],
       [ 0.46323564,  0.42180127],
       [ 0.05986653, -0.28048255],
       [ 0.86662001,  0.09719817],
       [ 1.44633208, -0.40014672],
       [-0.16337865, -0.16127637],
       [ 1.67337739,  0.21414815],
       [ 0.12090684, -0.07335171],
       [ 1.00150977, -1.20330939],
       [-1.08341458,  0.09457774],
       [-0.76744661,  0.66554367],
       [ 1.87319294, -0.22457853],
       [ 0.79900091, -0.71426284],
       [ 0.33843993,  0.18810723],
       [-0.88034678, -0.6673341 ],
       [ 1.69117309,  0.58285808],
       [ 0.41250235,  0.0569448 ],
       [ 1.33619922,  0.17075973],
       [-1.07724289,  0.71734101],
       [ 0.59447582,  0.2910611 ],
       [ 0.7486964 , -0.12479165],
       [-0.37343954, -0.32306194],
       [-0.57205473,  0.08762345],
       [-1.02236019,  0.35875018],
       [ 1.473023  , -0.59056969],
       [-0.16915403, -0.42990026],
       [ 0.31805201, -0.23635018],
       [ 1.29302452, -0.31961166],
       [-0.98729474,  0.72099981],
       [-3.22047282, -0.47742144],
       [ 0.06603684, -0.40856339],
       [-0.67425409,  0.4717721 ],
       [-3.53882951,  0.44070501],
       [-0.34818879,  0.31680161],
       [-0.51326037,  0.59873613],
       [ 0.25981991, -0.33590577],
       [ 0.0074283 , -0.59161892],
       [-1.67180277, -0.55337483],
       [ 0.15463461,  0.34164011],
       [-0.68371226,  0.67134652],
       [ 0.97914275, -0.19484769],
       [-0.00769769,  0.99298828],
       [ 1.89878797,  0.88546423],
       [-0.77031321, -0.30670068]])
plt.scatter(X[:,0],X[:,1],c=clusterID)
<matplotlib.collections.PathCollection at 0x32e7f9f70>
../_images/3.3_clustering_73_1.png

Geosciences Applications

In the following exercise, we will try to find clusters from features of a set of seismic waveforms. THe waveforms come from Mt Hood, an ice-capped volcano in the Cascades that experience seismic events from the volcanic acticities (magmatic and hydrothermal), tectonic context, and glacier/snow related seismic events.

PCA before clustering#

Let us generate synthetics data.

centers = np.array([[2, 2], [2, 8], [4, 3]])
radius = [0.1, 1]
synthetics = np.empty([0, 2])
for i in range(0, 3):
    X = centers[i, 0] + radius[0] * np.random.randn(100)
    Y = centers[i, 1] + radius[1] * np.random.randn(100)
    U = (X + Y) * sqrt(2) / 2
    V = (X - Y) * sqrt(2) / 2
    synthetics = np.concatenate([synthetics, np.vstack((U, V)).T])
plt.plot(synthetics[:,0], synthetics[:,1], 'ko')
plt.xlim(1, 9)
plt.ylim(-6, 3)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[89], line 7
      5 X = centers[i, 0] + radius[0] * np.random.randn(100)
      6 Y = centers[i, 1] + radius[1] * np.random.randn(100)
----> 7 U = (X + Y) * sqrt(2) / 2
      8 V = (X - Y) * sqrt(2) / 2
      9 synthetics = np.concatenate([synthetics, np.vstack((U, V)).T])

NameError: name 'sqrt' is not defined

Let us now do k-means clustering with 3 clusters.

random.seed(0)
(clusters, objective, centers) = my_kmeans(synthetics, 3)
plt.scatter(synthetics[:,0], synthetics[:,1], c=clusters)
<matplotlib.collections.PathCollection at 0x2bdd92640>
../_images/3.3_clustering_79_1.png

What happens if we apply PCA + normalization before the clustering?

pca = PCA(n_components=2)
synthetics_pca = pca.fit_transform(synthetics)
scaler = preprocessing.StandardScaler().fit(synthetics_pca)
synthetics_scaled = scaler.transform(synthetics_pca)
(clusters, objective, centers) = my_kmeans(synthetics_scaled, 3)
plt.scatter(synthetics[:,0], synthetics[:,1], c=clusters)
<matplotlib.collections.PathCollection at 0x2bde1d5e0>
../_images/3.3_clustering_83_1.png