Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics. [Wikipedia: https://en.wikipedia.org/wiki/Cluster_analysis]
A nice introduction to clustering in Python, and why it is a good option for EDA (exploratory data analysis) can be found here: http://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html This blog shows why python is efficient for this task, gives a general overview of the different algorithm.
The next step is to see what algorithms are implemented in the sklearn Python module, the strengths and drawbacks of each of them, through examples: http://scikit-learn.org/stable/modules/clustering.html#dbscan
import sys,os
from astropy.io import ascii,fits
from astropy.time import Time
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
%matplotlib inline
import numpy as np
from datetime import timedelta #datetime
from scipy.interpolate import interp1d
import seaborn as sns
import numpy.ma as ma
from statsmodels import robust
from time import time
from sklearn.decomposition import PCA
from sklearn import metrics
from sklearn.cluster import KMeans
from sklearn.datasets import load_digits
from sklearn.preprocessing import scale
import pandas as pd
pd.set_option('display.mpl_style', 'default') # Make the graphs a bit prettier
path_root = '/Users/jmilli/Documents/atmospheric_parameters/SCIDAR'
The example used in this python coffee is based on SCIDAR data. The SCIDAR is a turbulence profiler in Paranal in operations since mid-2016. I import here the turbulence profiles of the few available runs since the start of the operations (sparse data set), stored in the form of a csv file. For that I use the pandas module.
Cn2_pd = pd.read_csv(os.path.join(path_root,'Cn2_interpolated.csv'),
parse_dates=[0],dayfirst=True, index_col=0)#, sep=' ', encoding='latin1'
Let's have a look at the table:
Cn2_pd.head()
Cn2_pd.shape
We have 2878 turbulence profiles, and for each profile we have the strength of the turbulence (expressed in the form of a Cn^2 in m^1/2) for 116 altitude layers
Cn2_pd['00000.00m'].plot()
plt.ylabel('Ground layer turbulence')
The data are not evenly spread in time, we have 4 SCIDAR runs, and for each runs, a few hundred of profiles were measured over a few nights.
The column label is given as a string representing the altitude in m. I convert that string label in an array.
altitude = np.asarray([float(str_alt[0:-1]) for str_alt in Cn2_pd.columns])
print(altitude)
print(len(altitude))
Let's have a look at some profiles
for i in range(0,2870,300):
plt.semilogy(altitude,Cn2_pd.iloc[i, 0:116],label='{0:02d}'.format(i))
plt.legend(frameon=False)
plt.xlabel('Altitude in m')
plt.ylabel('Cn2 in m$^1/3$')
ncomp = 10
pca = PCA(n_components=ncomp)
PC = pca.fit_transform(Cn2_pd)
PC= np.asarray(PC)
plt.plot(pca.explained_variance_ratio_*100)
plt.ylabel('Explained variance in {0:')
plt.xlabel('principal_component')
plt.savefig(os.path.join(path_root,'explained_variance.pdf'))
#print(pca.explained_variance_ratio_)
nb_pc = 2
print('Sum of the explained variance for the first {0:d} PC: {1:3.1f}'.format(nb_pc,100*np.sum(pca.explained_variance_ratio_[0:nb_pc])))
#print(pca.explained_variance_)
plt.plot(PC[:,0],PC[:,1], 'k.', markersize=2)
plt.xlabel('First principal component ($m^{1/3}$)')
plt.ylabel('Second principal component ($m^{1/3}$)')
Now let's use the K-Means clustering technique. Say we want to group these data in 5 clusters. Applying the K-means algorithm takes 3 lines of code
n_clusters = 5
kmeans = KMeans(init='k-means++', n_clusters=n_clusters, n_init=5)
kmeans.fit(PC[:,0:2])
The rest is purely for plotting purposes and you can skip it
x_min, x_max = PC[:, 0].min() , PC[:, 0].max()
y_min, y_max = PC[:, 1].min() , PC[:, 1].max()
dx = (x_max-x_min)/100
print(dx)
Above, I just find the size of my cells in the PC1 vs PC2 space shown above
h = 4e-17 # point in the mesh [x_min, x_max]x[y_min, y_max].
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# Obtain labels for each point in mesh. Use last trained model.
Z = kmeans.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.figure(1)
plt.clf()
plt.imshow(Z, interpolation='nearest',
extent=(xx.min(), xx.max(), yy.min(), yy.max()),
cmap=plt.cm.Paired,
aspect='auto', origin='lower')
plt.plot(PC[:, 0], PC[:, 1], 'k.', markersize=2)
# Plot the centroids as a white X
centroids = kmeans.cluster_centers_
plt.scatter(centroids[:, 0], centroids[:, 1],
marker='x', s=100, linewidths=2,
color='w', zorder=10)
plt.title('K-means clustering on the principal components\n'
'Centroids are marked with white cross')
for i in range(n_clusters):
plt.text(centroids[i, 0], centroids[i, 1]+2*h,'{0:d}'.format(i),color='w')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xlabel('First principal component')
plt.ylabel('Second principal component')
plt.xticks(())
plt.yticks(())
The white dots are the center points of each cluster. Let's plot the mean profile of each cluster now
mean_profile = np.ndarray((n_clusters,Cn2_pd.shape[1]))
freq = np.ndarray((n_clusters))
for i in range(n_clusters):
id_cluster = kmeans.labels_ == i
freq[i] = np.sum(id_cluster)*1./Cn2_pd.shape[0]
mean_profile[i,:] = np.mean(Cn2_pd.iloc[id_cluster,:],axis=0)
for i in range(n_clusters):
plt.semilogy(altitude,mean_profile[i,:],label='Cluster {0:d}: {1:2.1f}%'.format(i,freq[i]*100))
plt.legend(frameon=False)
plt.xlabel('Altitude in m')
plt.ylabel('$Cn^2$ in $m^{1/3}$')
Here we have applied K-Means after a PCA on a 2D space. We could also apply directly to our 116D space
n_samples, n_features = Cn2_pd.shape
n_clusters = 5
kmeans2 = KMeans(init='k-means++', n_clusters=n_clusters, n_init=5)
kmeans2.fit(Cn2_pd)
As you see this is very fast. However we can't represent visually those clusters now.
mean_profile_2 = np.ndarray((n_clusters,Cn2_pd.shape[1]))
freq_2 = np.ndarray((n_clusters))
for i in range(n_clusters):
id_cluster = kmeans2.labels_ == i
freq_2[i] = np.sum(id_cluster)*1./Cn2_pd.shape[0]
mean_profile_2[i,:] = np.mean(Cn2_pd.iloc[id_cluster,:],axis=0)
for i in range(n_clusters):
plt.semilogy(altitude,mean_profile_2[i,:],label='Cluster {0:d}: {1:2.1f}'.format(i,freq_2[i]*100))
plt.legend(frameon=False)
plt.xlabel('Altitude in m')
plt.ylabel('$Cn^2$ in $m^{1/3}$')
This is another technique based on graphs. On strength is that you don't define the number of clusters but the algorithm define it automatically depending on the data.
from sklearn.cluster import AffinityPropagation
af = AffinityPropagation().fit(PC[:,0:2])
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_
n_clusters_ = len(cluster_centers_indices)
print('Estimated number of clusters: {0:d}'.format(n_clusters_))
from sklearn.cluster import DBSCAN
db = DBSCAN(eps=0.1, min_samples=10).fit(PC[:,0:2])
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print('Estimated number of clusters: %d' % n_clusters_)
unique_labels = set(labels)
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
data_db = PC[:,0:2]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = 'k'
class_member_mask = (labels == k)
xy = data_db[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=1)
xy = data_db[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=10)
plt.title('Estimated number of clusters: %d' % n_clusters_)