Wednesday, 3 February 2016

Why not approach classification through linear regression?


Classification and Logistic Regression

As Andrew Ng explains it, with linear regression you fit a polynomial through the data - say, like on the example below we're fitting a straight line through {tumor size, tumor type} sample set:
enter image description here
Above, malignant tumors get 1 and non-malignant ones get 0, and the green line is our hypothesis h(x). To make predictions we may say that for any given tumor size x, if h(x) gets bigger than 0.5we predict malignant tumor, otherwise we predict benign.
Looks like this way we could correctly predict every single training set sample, but now let's change the task a bit.
Intuitively it's clear that all tumors larger certain threshold are malignant. So let's add another sample with a huge tumor size, and run linear regression again:
enter image description here
Now our h(x)>0.5malignant doesn't work anymore. To keep making correct predictions we need to change it to h(x)>0.2 or something - but that not how the algorithm should work.
We cannot change the hypothesis each time a new sample arrives. Instead, we should learn it off the training set data, and then (using the hypothesis we've learned) make correct predictions for the data we haven't seen before.
Hope this explains why linear regression is not the best fit for classification problems! Also, you might want to watch VI. Logistic Regression. Classification video on ml-class.org which explains the idea in more detail.

No comments:

Post a Comment