Gradient descent and normal equation method for solving linear regression gives different solutions
Solution 1
As others mentioned, an ill-conditioned hessian matrix is likely the cause of your problem.
The number of steps that standard gradient descent algorithms take to reach a local optimum is a function of the largest eigenvalue of the hessian divided by the smallest (this is known as the condition number of the Hessian). So, if your matrix is ill-conditioned, then it could take an extremely large number of iterations for gradient descent to converge to an optimum. (For the singular case, it could converge to many points, of course.)
I would suggest trying three different things to verify that an unconstrained optimization algorithm works for your problem (which it should): 1) Generate some synthetic data by computing the result of a known linear function for random inputs and adding a small amount of gaussian noise. Make sure that you have many more data points than dimensions. This should produce a non-singular hessian. 2) Add a regularization terms to your error function to increase the condition number of the hessian. 3) Use a second order method like conjugate gradient or L-BFGS rather than gradient descent to reduce the number of steps needed for the algorithm to converge. (You will probably need to do this in conjunction with #2).
Solution 2
Could you post a little more about what you X looks like? You're using pinv() which is Moore-Penrose pseudo inverse. If the matrix is ill-conditioned this could cause problems with obtaining the inverse. I would bet that the gradient-descent method is closer to the mark.
Rasto
Entrepreneur, UX consultant, full-stack developer with strongest competency in Swift & iOS. Can get die-hard about top-notch UX and code quality. Diving deeper into React and getting excited about GraphQL. My pro history chapters are written in C#, Objective-C, Java, JS, TS, Flow, even Haskel, Prolog & Pascal. Sports and outdoor enthusiast. Love exploring cultures around the world. Found in mountains in the winter and at seaside during summer. Amater photographer. Always happy to learn and share what I learned e.g. externally giving lectures at my alma mater.
Updated on June 19, 2022Comments
-
Rasto almost 2 years
I'm working on machine learning problem and want to use linear regression as learning algorithm. I have implemented 2 different methods to find parameters
theta
of linear regression model: Gradient (steepest) descent and Normal equation. On the same data they should both give approximately equaltheta
vector. However they do not.Both
theta
vectors are very similar on all elements but the first one. That is the one used to multiply vector of all 1 added to the data.Here is how the
theta
s look like (fist column is output of Gradient descent, second output of Normal equation):Grad desc Norm eq -237.7752 -4.6736 -5.8471 -5.8467 9.9174 9.9178 2.1135 2.1134 -1.5001 -1.5003 -37.8558 -37.8505 -1.1024 -1.1116 -19.2969 -19.2956 66.6423 66.6447 297.3666 296.7604 -741.9281 -744.1541 296.4649 296.3494 146.0304 144.4158 -2.9978 -2.9976 -0.8190 -0.8189
What can cause the difference in
theta(1, 1)
returned by gradient descent compared totheta(1, 1)
returned by normal equation? Do I have bug in my code?Here is my implementation of normal equation in Matlab:
function theta = normalEque(X, y) [m, n] = size(X); X = [ones(m, 1), X]; theta = pinv(X'*X)*X'*y; end
Here is code for gradient descent:
function theta = gradientDesc(X, y) options = optimset('GradObj', 'on', 'MaxIter', 9999); [theta, ~, ~] = fminunc(@(t)(cost(t, X, y)),... zeros(size(X, 2), 1), options); end function [J, grad] = cost(theta, X, y) m = size(X, 1); X = [ones(m, 1), X]; J = sum((X * theta - y) .^ 2) ./ (2*m); for i = 1:size(theta, 1) grad(i, 1) = sum((X * theta - y) .* X(:, i)) ./ m; end end
I pass exactly the same data
X
andy
to both functions (I do not normalizeX
).Edit 1:
Based on answers and comments I checked few my code and run some tests.
First I want to check if the problem can be cause by X beeing near singular as suggested by @user1489497's answer. So I replaced pinv by inv - and when run it I really got warning
Matrix is close to singular or badly scaled.
. To be sure that that is not the problem I obtained much larger dataset and run tests with this new dataset. This timeinv(X)
did not display the warning and usingpinv
andinv
gave same results. So I hope thatX
is not close to singular any more.Then I changed
normalEque
code as suggested by woodchips so now it looks like:function theta = normalEque(X, y) X = [ones(size(X, 1), 1), X]; theta = pinv(X)*y; end
However the problem is still there. New
normalEque
function on new data that are not close to singular gives differenttheta
asgradientDesc
.To find out which algorithm is buggy I have run linear regression algorithm of data mining software Weka on the same data. Weka computed theta very similar to output of
normalEque
but different to the output ofgradientDesc
. So I guess thatnormalEque
is correct and there is a bug ingradientDesc
.Here is comparison of
theta
s computed by Weka,normalEque
andGradientDesc
:Weka(correct) normalEque gradientDesc 779.8229 779.8163 302.7994 1.6571 1.6571 1.7064 1.8430 1.8431 2.3809 -1.5945 -1.5945 -1.5964 3.8190 3.8195 5.7486 -4.8265 -4.8284 -11.1071 -6.9000 -6.9006 -11.8924 -15.6956 -15.6958 -13.5411 43.5561 43.5571 31.5036 -44.5380 -44.5386 -26.5137 0.9935 0.9926 1.2153 -3.1556 -3.1576 -1.8517 -0.1927 -0.1919 -0.6583 2.9207 2.9227 1.5632 1.1713 1.1710 1.1622 0.1091 0.1093 0.0084 1.5768 1.5762 1.6318 -1.3968 -1.3958 -2.1131 0.6966 0.6963 0.5630 0.1990 0.1990 -0.2521 0.4624 0.4624 0.2921 -12.6013 -12.6014 -12.2014 -0.1328 -0.1328 -0.1359
I also computed errors as suggested by Justin Peel's answer. Output of
normalEque
gives slightly lesser squared error but the difference is marginal. What is more when I compute gradient of cost oftheta
using functioncost
(the same as the one used bygradientDesc
) I got gradient near zero. Same done on output ofgradientDesc
does not give gradient near zero. Here is what I mean:>> [J_gd, grad_gd] = cost(theta_gd, X, y, size(X, 1)); >> [J_ne, grad_ne] = cost(theta_ne, X, y, size(X, 1)); >> disp([J_gd, J_ne]) 120.9932 119.1469 >> disp([grad_gd, grad_ne]) -0.005172856743846 -0.000000000908598 -0.026126463200876 -0.000000135414602 -0.008365136595272 -0.000000140327001 -0.094516503056041 -0.000000169627717 -0.028805977931093 -0.000000045136985 -0.004761477661464 -0.000000005065103 -0.007389474786628 -0.000000005010731 0.065544198835505 -0.000000046847073 0.044205371015018 -0.000000046169012 0.089237705611538 -0.000000046081288 -0.042549228192766 -0.000000051458654 0.016339232547159 -0.000000037654965 -0.043200042729041 -0.000000051748545 0.013669010209370 -0.000000037399261 -0.036586854750176 -0.000000027931617 -0.004761447097231 -0.000000027168798 0.017311225027280 -0.000000039099380 0.005650124339593 -0.000000037005759 0.016225097484138 -0.000000039060168 -0.009176443862037 -0.000000012831350 0.055653840638386 -0.000000020855391 -0.002834810081935 -0.000000006540702 0.002794661393905 -0.000000032878097
This would suggest that gradient descent simply did not converge to global minimum... But that is hardly the case as I run it for thousands of iterations. So where is the bug?