Homework Classic Machine Learning (50 points)
Contents
Homework Classic Machine Learning (50 points)#
In this Homework, we will try to classify observations of space to be either stars, galaxies or quasars based on the RD14 from the Sloan Digital Sky Survey (SDSS). The Sloan Digital Sky Survey is a project which offers public data of space observations. Observations have been made since 1998 and have been made accessible to everyone who is interested.
For this purpose a special 2.5 m diameter telescope was built at the Apache Point Observatory in New Mexico, USA. The telescope uses a camera of 30 CCD-Chips with 2048x2048 image points each. The chips are ordered in 5 rows with 6 chips in each row. Each row observes the space through different optical filters (u, g, r, i, z) at wavelengths of approximately 354, 476, 628, 769, 925 nm.
In this homework we will train several classifier to predict the class of a celestial object based on the observations (features). We will practice data prep, dimensionality reduction, model design and training, model comparison, and feature importance selection.
Importing Libraries#
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
%matplotlib inline
1. Data Preparation (20 points)#
We follow the following steps:
read (1 point)
clean (3 points)
correlate (4 points)
explore, spread of values (3 points)
dimensionality reduction (9 points)
!wget "https://raw.githubusercontent.com/UW-MLGEO/MLGeo-dataset/refs/heads/main/data/Skyserver_SQL2_27_2018 6_51_39 PM.csv"
1.1 Data read#
Read the pandas fata frame from the csv file “Skyserver_SQL2_27_2018 6_51_39 PM.csv” and skip the first row.
Task: read pandas data frame (1 point)
Save a copy of the data frame just in case.
Description of the data fields
objid = Object Identifier, self explanatory.
ra = J2000 Right Ascension (r-band). Angular that is measured eastward along the celestial equator from the Sun at the March equinox to the hour circle of the point above the earth in question. https://en.wikipedia.org/wiki/Right_ascension
dec = J2000 Declination (r-band). Angle that is measured north or south of the celestial equator, along the hour circle passing through the point in question. https://en.wikipedia.org/wiki/Declination
The Gunn astronomic magnitude system. u, g, r, i, z represent the response of the 5 bands of the telescope.
Further Information: http://astroweb.case.edu/ssm/ASTR620/mags.html
u = better of DeV/Exp magnitude fit
g = better of DeV/Exp magnitude fit
r = better of DeV/Exp magnitude fit
i = better of DeV/Exp magnitude fit
z = better of DeV/Exp magnitude fit
Run, rerun, camcol and field are features which describe a field within an image taken by the SDSS. A field is basically a part of the entire image corresponding to 2048 by 1489 pixels.
run = Run Number, which identifies the specific scan.
rereun = Rerun Number, which specifies how the image was processed.
camcol = Camera column, a number from 1 to 6, identifying the scanline within the run.
field = Field number, which typically starts at 11 (after an initial rampup time), and can be as large as 800 for particularly long runs.
specobjid = Object Identifier
class = object class (galaxy, star or quasar object): The class identifies an object to be either a galaxy, star or quasar. This will be the response variable which we will be trying to predict.
redshift = Final Redshift: In physics, redshift happens when light or other electromagnetic radiation from an object is increased in wavelength, or shifted to the red end of the spectrum.
plate = plate number: Each spectroscopic exposure employs a large, thin, circular metal plate that positions optical fibers via holes drilled at the locations of the images in the telescope focal plane. These fibers then feed into the spectrographs. Each plate has a unique serial number, which is called plate in views such as SpecObj in the CAS.
mjd = MJD of observation, Modified Julian Date, used to indicate the date that a given piece of SDSS data (image or spectrum) was taken.
fiberid = fiber ID. The SDSS spectrograph uses optical fibers to direct the light at the focal plane from individual objects to the slithead. Each object is assigned a corresponding fiberID.
Further information on SDSS images and their attributes:
1.2 Data Cleaning#
Basic stats about our dataset.
Task: Provide basic infor for the pandas dataframe head (0.5 point)
Task: Find the data types of the database (floats, string, etc etc) using the info()
function (0.5 point).
Are there any obvious feature (or element of the dataframe) that should not impact our prediction?
objid and specobjid are just identifiers for accessing the rows back when they were stored in the original databank. Therefore we will not need them for classification as they are not related to the outcome. The features run, rerun, camcol and field are values which describe parts of the camera at the moment when making the observation, e.g. ‘run’ represents the corresponding scan which captured the oject.
Source: http://www.sdss3.org/dr9/imaging/imaging_basics.php
Task: Drop these columns in the pandas dataframe. (1 point)
Find our how many examples there are, how many attributes or feature, and the type of class.
Task: How many objects are in each class? (1 point)
The classes are “GALAXY”, “STAR”, and “QSO” (quasars). They are defined as strings, but we will convert them to integer in order to apply a loss function on the class labels during training. For this, we use the sklearn.preprocessing.LabelEncoder()
function. We will do so and modify the classes in the dataframe. We should keep a copy of the original data frame to be safe.
1.3 Data correlations#
Now let’s find the most basic correlations among features. This can be done using the corr()
function to apply on the pandas dataframe. Evaluate this function and comment on what feature is correlated among others. It is convenient to use the matplotlib function matshow()
for clarity. seaborn
is a python module that makes really pretty statistical plots https://seaborn.pydata.org/index.html. Install it with pip
and import it.
Task: Plot the correlation matrix that can be called in the pandas dataframe. (2 points)
Hints:
Use functions of heatmap
, add the labels in the axes. The colormap coolwarm
is nice for divergent scales like correlations that vary between -1 and 1. The argument center=0
ensures that the colormap is divergent from zero. Make sure to ignore the label column “class”. Remember that dropping a column can be done in place sdss_df.drop('class', axis=1)
.
Task: Reproduce the same plot for each of the three classes. (1 point) You can select the values from the pandas dataframe by selecting over the column ‘class’.
Task: Can you comment on groups of observations that can be grouped together or that appear independent from each other given these correlations, and if there is any difference between the three celestial objects? (1 point)
1.5 Data exploration#
Given the structure of the correlations, we will explore the values of the data.
1.5.a. Distributions of redshift#
“redshifting” happens when the source of light is becoming more distant to the receiver: the object is moving away from Earth.
Task: plot histograms for the ‘redshift’ feature column for each class (1 point).
Task : Describe briefly the difference between the three histograms. (0.5 point)
Star:
Galaxy:
QSO:
1.5.b. Right ascension (ra) and declination (dec)#
We will now plot the right ascension versus the declination depending on the class. You can use the lmplot
function in seaborn
(https://seaborn.pydata.org/generated/seaborn.lmplot.html) to represent the sky view of these objects.
Task: do you see any obvious differences such that one could easily discriminate between the two coordinates? (0.5 point)
1.5.c Filters - u,g,r,i,z#
Recall: u, g, r, i, z represent the different wavelengths which are used to capture the observations. According to the correlation matrix, they are correlated for all three classes.
Therefore it is interesting to see that band ‘u’ is less correlated to the other bands.
Task Plot histograms and discuss why you expect these features to be correlated (1 points)
1.6 Data Dimensionality Reduction#
At this point, we are left with 8 features: redshift, u, g, r, i, z, ra, and dec. Among these, the filters (u, g, r, i, z) are correlated to each other. There is therefore a potential for reducing the dimensions of the features using PCA on these 5 features.
We will use the skilearn function sklearn.decomposition.PCA()
to fit and transform the data into the PC coordinates. Lets’ first explore how many PCs we need. Fit the PCA function over the total number of filters. You will fit the PCA function over an array with the columns selected from the dataframe.
Task: Perform the PCA over a max number of PCs, output the explained variance ratio values, decide on an appropriate maximum number of PC to use (6 points)
Answer on how many PCs to use
We will now re-perform PCA with the number of PCs you found is most appropriate. Re-apply the fit-transform function. Update the dataframe by adding the PCA value(s) and dropping the columns of the 5 filter features.
Task: PCA again, fit and transform, update the dataframe with the new feature(s) (3 points)
2. Unsupervised Clustering with KMeans (20 points)#
In this section, we will explore if the data features will be sufficient for classification. As a first exploration, we will perform unsupervised classification with Kmeans clustering.
2.1 Perform preliminary Kmeans (10 points)#
Implement Kmeans here for a given number of clusters and on the features of interest. Choose 3 features.
Use
sklearn
to perform Kmeans.Repeat Kmeans and discuss (in a markdown cell) the stability of clustering (e.g., use visualization to qualitatively assess the stability).
2.2 Find the optimal number of clusters (5 points)#
Use a method to establish the optimal number of clusters.
2.3 Discuss performance of clustering ( 5 points)#
Perform silhouette analysis (silhouette visualization and score)
Calculate (python cell) and discuss (net markdown cell) homogeneity with respect to the ground truth labels using 3 appropriate metrics.
Question: After performing KMeans clustering and calculating the completeness, homogeneity, and Fowlkes-Mallows scores, how can you determine if these scores are good? Compare the obtained scores to the ideal values and explain what each score indicates about the clustering quality. What do you find from your results?
3 Machine Learning Models (30 points)#
We will now train different models on this dataset. We have a total of 8 features, 3 classes, and 10,000 samples. We will use K-Nearest Neighbors, Naive Bayes, Random Forest, Support Vector Machine, Multi Layer Perceptron.
We now follow a normal machine learning workflow:
Feature scaling (3) points)
Train/test set split (2 points)
Model design, training, testing (15 points)
Model comparisons, pick your winner, discuss feature importance using Random Forest. (10 points)
3.1 Feature Scaling#
Scaling all values to be within the (0, 1) interval will reduce the distortion due to exceptionally high values and make some algorithms converge faster. You can scale the features only by dropping the “class” column without modifying the dataframe in place, using the pandas function drop()
.
Task: Scale just the features (3 points)
3.2 Test, train, validation data sets.#
Task: Split the data into a training and a test part. (2 points)
The models will be trained on the training data set and tested on the test data set
Computation time is important to account for when scaling up the data set and the model size. You can evaluate the relative computational time using the function time.perf_counter()
to evaluate the absolute time. Then compare the computational time by making the difference between two time stamps:
t1=time.perf_counter()
t2=time.perf_counter()
tcomp = t2 - t1
We will also assess the model performance of these multi-class classifiers. We will evaluate the average of the scores over the 3 class labels.
In the following, we will be testing over several classifiers. Follow the steps:
model definition/design
training
prediction on test
evaluation: a) print the classification_report; b) save the precision, recall, fscore and accuracy in variables
3.3.a K Nearest Neighbors (3 points)#
Check out the function arguments and definition here: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
3.3.b Naive Bayes (3 points)#
Check out the sklearn tutorial pages here: https://scikit-learn.org/stable/modules/naive_bayes.html#naive-bayes. We propose to use the Gaussian Naive Bayes.
Naive Bayes assumes the data to be normally distributed which can be achieved by scaling using the MaxAbsScaler. For this example then we will use the unscaled data, then rescale it.
3.3.c Random Forest Classifier (2 points)#
Check out the tutorial page here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
3.3.d Support Vector Machine Classifier (2 points)#
Check out the sklearn information page here: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC
3.3.e Multi-Layer Perceptron (3 points)#
Check out the information page here: https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html
3.4 Model performance and comparison#
3.4.a Confusion Matrix and interpretation#
Task: Plot the confusion matrix (2 points)
Use sklearn ``ConfusionMatrixDisplay” from the sklearn to visualize the confusion matrix
Task: Comment on what you see the best classifier is likely to be (1 point). You can also comment on the misclassification and confusion rates.
3.4.a K Fold Cross Validation#
We will now perform k fold cross valdiation for the classifiers. We use the function cross_val_score
on each ewstimator, on the training set, with 10 folds, and use accuracy as a score metric.
Task: perform the cross validation over K folds, output the mean and standard deviation of the accuracy (3 points)
Task: Which method won the Xval test (1 point) ?
see the cell below
3.4.c And the winner is …#
Let’s compare the results. Task: Create a pandas dataframe with all of the performance metrics, including the results from K-fold cross validation. (2 points)
Task: Comment on the accuracy and performance and choose a winner. (1 point)
see the cell below
4 Summary (4 points)#
4.1 Feature Importance using Random Forest Classifier#
Decision Trees have the unique property of being able to order features by their ability to split between the classes. If some features dominate over other in the predictive power of classes, one can further reduce the dimension of the features for additional analysis. The vector of feature importance is the module rfc.feature_importances_
, sorted with ascending importance. Store the vector of importance .
Task: plot a bar plot using the function matplotlib.pyplot.bar
. (2 points)
Task: What are the top three features (1 point)?
enter in the cell below
In this notebook, you have learned that redshift was the best predictor of what object you are observing. Now, did you actualy need to do this all to find this out? Probably not if you were an astrophysicist! But hey, we are not. So great job!
Task: Briefly comment on what you have learned (1 point)
see the cell below