Loading [MathJax]/jax/output/CommonHTML/jax.js

Statistical Learning 4

Arjun Jayaprakash

2019/09/30

Classification

Response is not quantitative but are qualitative. Qualitative variables take values in an unordered set C, such as

eyecolor{brown,green,blue}email{spam,ham}

Given a feature vector X and a qualitative response Y taking values in the set C, the classification task is to build a function C(X) that takes as input the feature vector X and predicts its value for Y, i.e., C(X)C. Often we are more interested in estimating the probabilities that X belongs to each category in C.

Logistic Regression

Logistic regression is used to predict the probability of observing either one or the other mutually exclusive result based on a given input variable. Let’s write p(X)=P(Y=1|X) for short and consider using one variable to predict the probability. Logistic regression uses the form:

p(X)=eβ0+β1X1+eβ0+β1X

The value of p(X) always lies between 0 and 1. This is basically a transformation of a linear model to guarantee that we get a probability. This is done by the logit transformation:

log(p(X)1p(X))=β0+β1X

We can use the maximum likelihood to estimate this model from a given data.

L(β0,β)=i:yi=1p(xi)i:yi=0(1p(xi))

This likelihood(L) gives the probability of the observed zeros and ones in the data. We pick β0 and β1 to maximize the likelihood of the observed data.

Most statistical packages can fit linear logistic regression models by maximum likelihood. In R we can use the glm function. This would give us the ˆβ0 and ˆβ1 that best fit our data. Once we have the parameters, we can predict the probability of observing the response for any observed predictor.

Mutlivariate Logistic Regression

Logistic regression models with multiple predictor variables look like:

log(p(X)1p(X))=β0+β1X1+...+βpXpp(X)=eβ0+β1X1+...+βpXp1+eβ0+β1X1+...+βpXp

What one has to be careful about in doing multiple variable regression is the counfounding effect of correlated predictors. Remember no causality can be claimed for observational data. Causality can only be inferred if we set up an experiment to investigate causality.

Logistic regression with more than two classes as response

So far we have discussed logistic regression with two classes. It is easily generalized to more than two classes. One version (used in the R package glmnet) has the symmetric form

P(Y=k|X)=eβ0k+β1kX1+...+βpkXpKl=1eβ0l+β1lX1+...+βplXp

This is also referred to as multinomial regression.

Discriminant Analysis

This is a different approach to classification. Here, the approach is to model the distribution of X in each of the classes separately, and then use Bayes theorem to flip things around and obtain P(Y|X). When we use normal distributions for each class, this leads to linear or quadratic discriminant analysis. However, this approach is quite general, and other distributions can be used as well.

Bayes theorem:

P(Y=k|X=x)=P(X=x|Y=k)P(Y=k)P(X=x)

One writes this slightly differently for discriminant analysis:

P(Y=k|X=x)=πkfk(x)Kl=1πlfl(x)

where, fk(x)=P(X=x|Y=k) is the density for X in class k. In other words this is the likelihood of observing X=x assuming Y=k is the true result. Also, πk=P(Y=k) is the marginal or prior probability for class k.

Why use discriminant analysis?

  • When the classes are well-separated, the parameter estimates for the logistic regression model are surprisingly unstable. Linear discriminant analysis does not suffer from this problem.

  • If n is small and the distribution of the predictors X is approximately normal in each of the classes, the linear discriminant model is again more stable than the logistic regression model.

  • Linear discriminant analysis is popular when we have more than two response classes, because it also provides low-dimensional views of the data.

Linear Discriminant Analysis when p = 1

Density function for Gaussian (normal) distribution is:

fk(x)=12πσke12(xμkσk)2 where μk is the mean, and σ2k is the variance (in class k). Teh assumption here is that all the σk=σ are the same.

When we plug this in the Bayes theorem we get a formula for pk(x). To classify at the value of X=x, we need to see which of the pk(x) is largest. Taking logs, and discarding terms that do not depend on k, we see that this is equivalent to assigning x to the class with the largest discriminant score.

δk(x)=xμkσ2μ2k2σ2+log(πk)

This δk(x) is a linear function of x.

When there are more than one predictors, p>1, the density functions involved in discriminant analysis are joint densities. The math gets complicated here as we include covariance matrices in the expressions to get the discriminant score.

What is great is that it turns out that the discriminant scores can be turned into probabilities pretty easily.

ˆP(Y=k|X=x)=eˆδk(x)Kl=1eδl(x)

So classifying to the largest score amounts to classifying to the class for which the probability is the largest.