{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Clustering with sklearn" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics.\n", "[Wikipedia: https://en.wikipedia.org/wiki/Cluster_analysis]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A nice introduction to clustering in Python, and why it is a good option for EDA (exploratory data analysis) can be found here: http://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html\n", "This blog shows why python is efficient for this task, gives a general overview of the different algorithm." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The next step is to see what algorithms are implemented in the sklearn Python module, the strengths and drawbacks of each of them, through examples: http://scikit-learn.org/stable/modules/clustering.html#dbscan" ] }, { "cell_type": "raw", "metadata": {}, "source": [ "I propose here an example based on my own research, added one more tool to those presented in the websites above: principal component analysis." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "/Users/jmilli/anaconda/lib/python3.5/site-packages/IPython/core/interactiveshell.py:2881: FutureWarning: \n", "mpl_style had been deprecated and will be removed in a future version.\n", "Use `matplotlib.pyplot.style.use` instead.\n", "\n", " exec(code_obj, self.user_global_ns, self.user_ns)\n" ] } ], "source": [ "import sys,os\n", "from astropy.io import ascii,fits\n", "from astropy.time import Time\n", "import matplotlib.pyplot as plt\n", "import matplotlib.gridspec as gridspec \n", "%matplotlib inline\n", "import numpy as np\n", "from datetime import timedelta #datetime\n", "from scipy.interpolate import interp1d\n", "import seaborn as sns\n", "import numpy.ma as ma\n", "from statsmodels import robust\n", "\n", "from time import time\n", "from sklearn.decomposition import PCA\n", "from sklearn import metrics\n", "from sklearn.cluster import KMeans\n", "from sklearn.datasets import load_digits\n", "from sklearn.preprocessing import scale\n", "\n", "import pandas as pd\n", "pd.set_option('display.mpl_style', 'default') # Make the graphs a bit prettier\n", "\n", "path_root = '/Users/jmilli/Documents/atmospheric_parameters/SCIDAR'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The example used in this python coffee is based on SCIDAR data. The SCIDAR is a turbulence profiler in Paranal in operations since mid-2016. \n", "I import here the turbulence profiles of the few available runs since the start of the operations (sparse data set), stored in the form of a csv file. For that I use the pandas module." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [], "source": [ "Cn2_pd = pd.read_csv(os.path.join(path_root,'Cn2_interpolated.csv'), \n", " parse_dates=[0],dayfirst=True, index_col=0)#, sep=' ', encoding='latin1'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's have a look at the table:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", " | 00000.00m | \n", "00262.21m | \n", "00524.41m | \n", "00786.62m | \n", "01048.83m | \n", "01311.03m | \n", "01573.24m | \n", "01835.44m | \n", "02097.65m | \n", "02359.86m | \n", "... | \n", "27793.88m | \n", "28056.08m | \n", "28318.29m | \n", "28580.50m | \n", "28842.70m | \n", "29104.91m | \n", "29367.12m | \n", "29629.32m | \n", "29891.53m | \n", "30153.74m | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Date | \n", "\n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " | \n", " |
2016-04-29 23:37:19 | \n", "4.390000e-16 | \n", "0.000000e+00 | \n", "3.980000e-18 | \n", "1.200000e-17 | \n", "4.130000e-18 | \n", "1.070000e-17 | \n", "1.050000e-17 | \n", "1.030000e-17 | \n", "1.500000e-17 | \n", "1.550000e-17 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0 | \n", "
2016-04-29 23:38:44 | \n", "7.710000e-16 | \n", "0.000000e+00 | \n", "1.710000e-18 | \n", "9.630000e-18 | \n", "9.540000e-18 | \n", "4.370000e-18 | \n", "1.180000e-17 | \n", "5.970000e-18 | \n", "1.200000e-17 | \n", "1.740000e-17 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0 | \n", "
2016-04-29 23:40:08 | \n", "6.800000e-16 | \n", "0.000000e+00 | \n", "2.640000e-18 | \n", "8.300000e-18 | \n", "3.140000e-18 | \n", "9.290000e-18 | \n", "1.120000e-17 | \n", "9.110000e-18 | \n", "1.090000e-17 | \n", "1.580000e-17 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0 | \n", "
2016-04-29 23:42:11 | \n", "3.500000e-16 | \n", "2.180000e-16 | \n", "0.000000e+00 | \n", "4.200000e-18 | \n", "1.560000e-17 | \n", "1.500000e-17 | \n", "7.910000e-18 | \n", "6.380000e-18 | \n", "1.160000e-17 | \n", "1.490000e-17 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0 | \n", "
2016-04-29 23:43:41 | \n", "2.630000e-16 | \n", "8.660000e-18 | \n", "1.830000e-17 | \n", "1.190000e-17 | \n", "5.420000e-18 | \n", "1.220000e-17 | \n", "4.710000e-18 | \n", "5.020000e-18 | \n", "8.140000e-18 | \n", "8.210000e-18 | \n", "... | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0.0 | \n", "0 | \n", "
5 rows × 116 columns
\n", "