Why is weight vector orthogonal to decision plane in neural networks

machine-learning neural-network artificial-intelligence perceptron biological-neural-network

17,236

Solution 1

The weights are simply the coefficients that define a separating plane. For the moment, forget about neurons and just consider the geometric definition of a plane in N dimensions:

w1*x1 + w2*x2 + ... + wN*xN - w0 = 0

You can also think of this as being a dot product:

w*x - w0 = 0

where w and x are both length-N vectors. This equation holds for all points on the plane. Recall that we can multiply the above equation by a constant and it still holds so we can define the constants such that the vector w has unit length. Now, take out a piece of paper and draw your x-y axes (x1 and x2 in the above equations). Next, draw a line (a plane in 2D) somewhere near the origin. w0 is simply the perpendicular distance from the origin to the plane and w is the unit vector that points from the origin along that perpendicular. If you now draw a vector from the origin to any point on the plane, the dot product of that vector with the unit vector w will always be equal to w0 so the equation above holds, right? This is simply the geometric definition of a plane: a unit vector defining the perpendicular to the plane (w) and the distance (w0) from the origin to the plane.

Now our neuron is simply representing the same plane as described above but we just describe the variables a little differently. We'll call the components of x our "inputs", the components of w our "weights", and we'll call the distance w0 a bias. That's all there is to it.

Getting a little beyond your actual question, we don't really care about points on the plane. We really want to know which side of the plane a point falls on. While w*x - w0 is exactly zero on the plane, it will have positive values for points on one side of the plane and negative values for points on the other side. That's where the neuron's activation function comes in but that's beyond your actual question.

Solution 2

Intuitively, in a binary problem the weight vector points in the direction of the '1'-class, while the '0'-class is found when pointing away from the weight vector. The decision boundary should thus be drawn perpendicular to the weight vector.

See the image for a simplified example: You have a neural network with only 1 input which thus has 1 weight. If the weight is -1 (the blue vector), then all negative inputs will become positive, so the whole negative spectrum will be assigned to the '1'-class, while the positive spectrum will be the '0'-class. The decision boundary in a 2-axis plane is thus a vertical line through the origin (the red line). Simply said it is the line perpendicular to the weight vector.

Lets go through this example with a few values. The output of the perceptron is class 1 if the sum of all inputs * weights is larger than 0 (the default threshold), otherwise if the output is smaller than the threshold of 0 then the class is 0. Your input has value 1. The weight applied to this single input is -1, so 1 * -1 = -1 which is less than 0. The input is thus assigned class 0 (NOTE: class 0 and class 1 could have just been called class A or class B, don't confuse them with the input and weight values). Conversely, if the input is -1, then input * weight is -1 * -1 = 1, which is larger than 0, so the input is assigned to class 1. If you try every input value then you will see that all the negative values in this example have an output larger than 0, so all of them belong to class 1. All positive values will have an output of smaller than 0 and therefore will be classified as class 0. Draw the line which separates all positive and negative input values (the red line) and you will see that this line is perpendicular to the weight vector.

Also note that the weight vector is only used to modify the inputs to fit the wanted output. What would happen without a weight vector? An input of 1, would result in an output of 1, which is larger than the threshold of 0, so the class is '1'.

The second image on this page shows a perceptron with 2 inputs and a bias. The first input has the same weight as my example, while the second input has a weight of 1. The corresponding weight vector together with the decision boundary are thus changed as seen in the image. Also the decision boundary has been translated to the right due to an added bias of 1.

Solution 3

Here is a viewpoint from a more fundamental linear algebra/calculus standpoint:

The general equation of a plane is Ax + By + Cz = D (can be extended for higher dimensions). The normal vector can be extracted from this equation: [A B C]; it is the vector orthogonal to every other vector that lies on the plane.

Now if we have a weight vector [w1 w2 w3], then when do w^T * x >= 0 (to get positive classification) and w^T * x < 0 (to get negative classification). WLOG, we can also do w^T * x >= d. Now, do you see where I am going with this?

The weight vector is the same as the normal vector from the first section. And as we know, this normal vector (and a point) define a plane: which is exactly the decision boundary. Hence, because the normal vector is orthogonal to the plane, then so too is the weight vector orthogonal to the decision boundary.

Solution 4

Start with the simplest form, ax + by = 0, weight vector is [a, b], feature vector is [x, y]
Then y = (-a/b)x is the decision boundary with slope -a/b
The weight vector has slope b/a
If you multiply those two slopes together, result is -1
This proves decision boundary is perpendicular to weight vector

Solution 5

Although the question was asked 2 years ago, I think many students will have the same doubts. I reached this answer because I asked the same question.

Now, just think of X, Y (a Cartesian coordinate system is a coordinate system that specifies each point uniquely in a plane by a pair of numerical coordinates, which are the signed distances from the point to two fixed perpendicular directed lines [from Wikipedia]).

If Y = 3X, in geometry Y is perpendicular to X, then let w = 3, then Y = wX, w = Y/X and if we want to draw the relation between X, w we will have two perpendicular lines just like when we draw the relation between X, Y. So always think of the w-coefficient as perpendicular to X and Y.

View more solutions

17,236

Author by

8A52

Updated on June 06, 2022

Comments

8A52 about 2 years

I am beginner in neural networks. I am learning about perceptrons. My question is Why is weight vector perpendicular to decision boundary(Hyperplane)? I referred many books but all are mentioning that weight vector is perpendicular to decision boundary but none are saying why?

Can anyone give me an explanation or reference to a book?
8A52 about 12 years

Hi Thanks a lot for the answer. I am still not convinced. Why weight vector always point towards '1-class'? and why it should be perpendicular. I will be very happy if u can provide a mathematical proof :-)
Sicco about 12 years

I have extended my answer. Has it become more clear? Or do you still have a specific question?
8A52 about 12 years

Hi I know this math which is about neuron activation.My doubt is about the first paragraph itself. I want to know why is weight vector perpendicular to decision boundary. You said weight vector points in the direction of class-1 why? And thus weight vector should be perpendicular. why?Can you show me mathematically that weight vector is perpendicular to decision boundary. I am sorry but I don't get that intution why weigh vector points towards class-1 :(
Sicco about 12 years

I extended the answer again. I'm not sure of I will be able to explain it better. It is really quite simple, there is no magic going on. Read the answer a few more times and try out some calculations for yourself.
jds over 7 years

Really great explanation. This makes the algorithm on Wikipedia make perfect sense. Can you explain the difference between that algorithm and this one: cs.princeton.edu/courses/archive/fall16/cos402/lectures/… (slide 25)? Here, we take the dot product of w^T—I assume my professor means w^{\bot} in LaTeX—but then update w. But with your explanation, we take the dot product with respect to w and then update w. Am I missing something?
bogatron over 7 years

Thanks. The "T" in the exponent on slide 25 means "transpose" because w is a vector. So $w^{T}x_{i}$ (in LaTeX) is just the dot product of w and x_i.
MrRobot9 over 5 years

How is w0 perpendicular distance from origin and plane?
information_interchange over 5 years

If you now draw a vector from the origin to any point on the plane, the dot product of that vector with the unit vector w will always be equal to w0 so the equation above holds, right? Why is this the case?
bogatron over 5 years

@information_interchange Because the equation is simply the Hesse normal form of the definition of a plane. For a detailed explanation and derivation of the Hesse normal form, see the link in my previous comment.
muhammad800804 over 4 years

"This is simply the geometric definition of a plane: a unit vector defining the perpendicular to the plane (w) and the distance (w0) from the origin to the plane." Little bit confused about the terminology. According to the equation of the plane, orthogonal vector for plane x -y = 0 is (1,-1). Am I correct?Any explanation for this?
bogatron over 4 years

@himu800804, the case you described is a line through the origin with slope equal to 1. In that case w0 = 0 and the unit normal is w = (1/sqrt(2),-1/sqrt(2)) or ~(0.7071,-0.7071). Since the plane passes through the origin, the signed distance of any point to the plane is just the dot product of the point's coordinates with w. For example, the point (1,1) has distance 0 (i.e., it is on the plane), (1,-1) has distance sqrt(2), and (-1,1) has distance -sqrt(2).
muhammad800804 over 4 years

I am struggling how do you calculate, w = w = (1/sqrt(2),-1/sqrt(2)). Also, why you are calling w a unit normal instead of normal. Please explain.
bogatron over 4 years

@himu800804 The vector (1, -1) does point in the right direction (normal to the plane) but it does not have unit length (magnitude). It has length sqrt(2). So to make it a unit normal (a normal vector with length 1), you must divide it by it's magnitude, which results in the value of w I described.