3.2 Classification and Regression#

Problems that need a quantitative response (numeric value) are regression; problems that need a qualitative response (boolean or category) are classification. Many statistical methods can be applied to both types of problems.

Binary classification has two output classes. They usually end up being “A” and “not A”. Examples are “earthquake” or “no earthquake=noise”. Multiclass classification refers to one with more than two classes.

Classification here requires that we know the labels, it is a form of supervised learning.

1. Classification Algorithms#

There are several classifier algorithms, which we will summarize below before practicing.

  • Logistic Regression:

  • Linear Discriminant Analysis (LDA): The LDA optimiziation methods produces an optimal dimensionality reductions to a decision line for classificaiton. It is based on variance reduction and has analogy to a PCA coordinate system.

  • Stochastic Gradient Descent (SGD):

  • Naive Bayes (NB): Simple algorithm that requires little hyper-parameters, provides interpretable results. The algorithm computes conditional probabilities and uses the product as a decision rule to maximize the probability in each class.

  • K-nearest neighbors (KNN): Choose K as the numbers of nearest data points to consider. Gather each data sample and the K nearest ones, assign the class that is most represented in that group (the mode of the K labels).

  • Support Vector Machine (SVM): Finds the hyperplanes that separate the classes with sufficient margins. The hyperplanes can be linear and more complex (kernels SVM such as radial basis function and polynomial kernels). SVM was very popular for limited training data.

  • Random Forest (RF): Decision trees are common for prediction pipelines. Decision tree learning is a method to create a predictive model of trees based on the data. More on that this monday.

Some classifiers can handle multi class natively (Stochastic Gradient Descent - SGD; Random Forest classification; Naive Bayes). Others are strictly binary classifiers (Logistic Regression, Support Vector Machine classifier - SVM).

2. Regression Algorithms#

2.1 Linear Regression#

Let \(y\) be the data, and \(\hat{y}\) be the predicted value of the data. A general linear regression can be formulated as

\(\hat{y} = w_0 + w_1 x_1 + ... + w_n x_n = h_w (\mathbf{x})\).

\(\mathbf{\hat{y}} = \mathbf{G} \mathbf{w}\).

\(y\) is a data vector of length \(m\), \(\mathbf{x}\) is a feature vector of length \(n\). \(\mathbf{w}\) is a vector of model parameter, \(h_w\) is referred to as the hypothesis function or the model using the model parameter \(w\). In the most simple case of a linear regression with time, the formulation becomes:

\(\hat{y} = w_0 + w_1 t\),

where \(x_1 = t\) the time feature.

To evaluate how well the model performs, we will compute a loss score, or a residual. It is the result of applying a loss or cost or objective function to the prediction and the data. The most basic cost function is the Mean Square Error (MSE):

\(MSE(\mathbf{x},h_w) = \frac{1}{m} \sum_{i=1}^{m} \left( h_w(\mathbf{x})_i - y_i \right)^2 = \frac{1}{m} \sum_{i=1}^{m} \left( \hat{y}_i - y_i \right)^2 \), in the case of a linear regression.

The Normal Equation is the solution to the linear regression that minimize the MSE.

\(\mathbf{w} = \left( \mathbf{x}^T\mathbf{x} \right)^{-1} \mathbf{x}^T \mathbf{y}\)

This compares with the classic inverse problem framed by \(\mathbf{d} = \mathbf{G} \mathbf{m}\).

\(\mathbf{m} = \left( \mathbf{G}^T\mathbf{G} \right)^{-1} \mathbf{G}^T \mathbf{d} \)

It can be solved using Numpy linear algebra module. If \(\left( \mathbf{x}^T\mathbf{x} \right) \) is singular and cannot be inverted, a lower rank matrix called the pseudoinverse can be calculated using singular value decomposition. We also used in a previous class the Scikit-learn function for sklearn.linear_model.LinearRegression, which is the implementation of the pseudoinverse. We practice below how to use these standard inversions: