Simple binary logistic regression using MATLAB
It sounds like your data may be linearly separable. In short, that means since your input data is one dimensional, that there is some value of x
such that all values of x < xDiv
belong to one class (say y = 0
) and all values of x > xDiv
belong to the other class (y = 1
).
If your data were two-dimensional this means you could draw a line through your two-dimensional space X
such that all instances of a particular class are on one side of the line.
This is bad news for logistic regression (LR) as LR isn't really meant to deal with problems where the data are linearly separable.
Logistic regression is trying to fit a function of the following form:
This will only return values of y = 0
or y = 1
when the expression within the exponential in the denominator is at negative infinity or infinity.
Now, because your data is linearly separable, and Matlab's LR function attempts to find a maximum likelihood fit for the data, you will get extreme weight values.
This isn't necessarily a solution, but try flipping the labels on just one of your data points (so for some index t
where y(t) == 0
set y(t) = 1
). This will cause your data to no longer be linearly separable and the learned weight values will be dragged dramatically closer to zero.
chex
Updated on July 05, 2022Comments
-
chex almost 2 years
I'm working on doing a logistic regression using MATLAB for a simple classification problem. My covariate is one continuous variable ranging between 0 and 1, while my categorical response is a binary variable of 0 (incorrect) or 1 (correct).
I'm looking to run a logistic regression to establish a predictor that would output the probability of some input observation (e.g. the continuous variable as described above) being correct or incorrect. Although this is a fairly simple scenario, I'm having some trouble running this in MATLAB.
My approach is as follows: I have one column vector
X
that contains the values of the continuous variable, and another equally-sized column vectorY
that contains the known classification of each value ofX
(e.g. 0 or 1). I'm using the following code:[b,dev,stats] = glmfit(X,Y,'binomial','link','logit');
However, this gives me nonsensical results with a
p = 1.000
, coefficients (b
) that are extremely high (-650.5, 1320.1), and associated standard error values on the order of 1e6.I then tried using an additional parameter to specify the size of my binomial sample:
glm = GeneralizedLinearModel.fit(X,Y,'distr','binomial','BinomialSize',size(Y,1));
This gave me results that were more in line with what I expected. I extracted the coefficients, used
glmval
to create estimates (Y_fit = glmval(b,[0:0.01:1],'logit');
), and created an array for the fitting (X_fit = linspace(0,1)
). When I overlaid the plots of the original data and the model usingfigure, plot(X,Y,'o',X_fit,Y_fit'-')
, the resulting plot of the model essentially looked like the lower 1/4th of the 'S' shaped plot that is typical with logistic regression plots.My questions are as follows:
1) Why did my use of
glmfit
give strange results?
2) How should I go about addressing my initial question: given some input value, what's the probability that its classification is correct?
3) How do I get confidence intervals for my model parameters?glmval
should be able to input thestats
output fromglmfit
, but my use ofglmfit
is not giving correct results.Any comments and input would be very useful, thanks!
UPDATE (3/18/14)
I found that
mnrval
seems to give reasonable results. I can use[b_fit,dev,stats] = mnrfit(X,Y+1);
whereY+1
simply makes my binary classifier into a nominal one.I can loop through
[pihat,lower,upper] = mnrval(b_fit,loopVal(ii),stats);
to get variouspihat
probability values, whereloopVal = linspace(0,1)
or some appropriate input range and `ii = 1:length(loopVal)'.The
stats
parameter has a great correlation coefficient (0.9973), but the p values forb_fit
are 0.0847 and 0.0845, which I'm not quite sure how to interpret. Any thoughts? Also, why wouldmrnfit
work overglmfit
in my example? I should note that the p-values for the coefficients when usingGeneralizedLinearModel.fit
were bothp<<0.001
, and the coefficient estimates were quite different as well.Finally, how does one interpret the
dev
output from themnrfit
function? The MATLAB document states that it is "the deviance of the fit at the solution vector. The deviance is a generalization of the residual sum of squares." Is this useful as a stand-alone value, or is this only compared todev
values from other models?