How can the Euclidean distance be calculated with NumPy?
Solution 1
Use numpy.linalg.norm
:
dist = numpy.linalg.norm(a-b)
This works because the Euclidean distance is the l2 norm, and the default value of the ord
parameter in numpy.linalg.norm
is 2.
For more theory, see Introduction to Data Mining:
Solution 2
There's a function for that in SciPy. It's called Euclidean.
Example:
from scipy.spatial import distance
a = (1, 2, 3)
b = (4, 5, 6)
dst = distance.euclidean(a, b)
Solution 3
For anyone interested in computing multiple distances at once, I've done a little comparison using perfplot (a small project of mine).
The first advice is to organize your data such that the arrays have dimension (3, n)
(and are C-contiguous obviously). If adding happens in the contiguous first dimension, things are faster, and it doesn't matter too much if you use sqrt-sum
with axis=0
, linalg.norm
with axis=0
, or
a_min_b = a - b
numpy.sqrt(numpy.einsum('ij,ij->j', a_min_b, a_min_b))
which is, by a slight margin, the fastest variant. (That actually holds true for just one row as well.)
The variants where you sum up over the second axis, axis=1
, are all substantially slower.
Code to reproduce the plot:
import numpy
import perfplot
from scipy.spatial import distance
def linalg_norm(data):
a, b = data[0]
return numpy.linalg.norm(a - b, axis=1)
def linalg_norm_T(data):
a, b = data[1]
return numpy.linalg.norm(a - b, axis=0)
def sqrt_sum(data):
a, b = data[0]
return numpy.sqrt(numpy.sum((a - b) ** 2, axis=1))
def sqrt_sum_T(data):
a, b = data[1]
return numpy.sqrt(numpy.sum((a - b) ** 2, axis=0))
def scipy_distance(data):
a, b = data[0]
return list(map(distance.euclidean, a, b))
def sqrt_einsum(data):
a, b = data[0]
a_min_b = a - b
return numpy.sqrt(numpy.einsum("ij,ij->i", a_min_b, a_min_b))
def sqrt_einsum_T(data):
a, b = data[1]
a_min_b = a - b
return numpy.sqrt(numpy.einsum("ij,ij->j", a_min_b, a_min_b))
def setup(n):
a = numpy.random.rand(n, 3)
b = numpy.random.rand(n, 3)
out0 = numpy.array([a, b])
out1 = numpy.array([a.T, b.T])
return out0, out1
b = perfplot.bench(
setup=setup,
n_range=[2 ** k for k in range(22)],
kernels=[
linalg_norm,
linalg_norm_T,
scipy_distance,
sqrt_sum,
sqrt_sum_T,
sqrt_einsum,
sqrt_einsum_T,
],
xlabel="len(x), len(y)",
)
b.save("norm.png")
Solution 4
I want to expound on the simple answer with various performance notes. np.linalg.norm will do perhaps more than you need:
dist = numpy.linalg.norm(a-b)
Firstly - this function is designed to work over a list and return all of the values, e.g. to compare the distance from pA
to the set of points sP
:
sP = set(points)
pA = point
distances = np.linalg.norm(sP - pA, ord=2, axis=1.) # 'distances' is a list
Remember several things:
- Python function calls are expensive.
- [Regular] Python doesn't cache name lookups.
So
def distance(pointA, pointB):
dist = np.linalg.norm(pointA - pointB)
return dist
isn't as innocent as it looks.
>>> dis.dis(distance)
2 0 LOAD_GLOBAL 0 (np)
2 LOAD_ATTR 1 (linalg)
4 LOAD_ATTR 2 (norm)
6 LOAD_FAST 0 (pointA)
8 LOAD_FAST 1 (pointB)
10 BINARY_SUBTRACT
12 CALL_FUNCTION 1
14 STORE_FAST 2 (dist)
3 16 LOAD_FAST 2 (dist)
18 RETURN_VALUE
Firstly - every time we call it, we have to do a global lookup for "np", a scoped lookup for "linalg" and a scoped lookup for "norm", and the overhead of merely calling the function can equate to dozens of python instructions.
Lastly, we wasted two operations on to store the result and reload it for return...
First pass at improvement: make the lookup faster, skip the store
def distance(pointA, pointB, _norm=np.linalg.norm):
return _norm(pointA - pointB)
We get the far more streamlined:
>>> dis.dis(distance)
2 0 LOAD_FAST 2 (_norm)
2 LOAD_FAST 0 (pointA)
4 LOAD_FAST 1 (pointB)
6 BINARY_SUBTRACT
8 CALL_FUNCTION 1
10 RETURN_VALUE
The function call overhead still amounts to some work, though. And you'll want to do benchmarks to determine whether you might be better doing the math yourself:
def distance(pointA, pointB):
return (
((pointA.x - pointB.x) ** 2) +
((pointA.y - pointB.y) ** 2) +
((pointA.z - pointB.z) ** 2)
) ** 0.5 # fast sqrt
On some platforms, **0.5
is faster than math.sqrt
. Your mileage may vary.
**** Advanced performance notes.
Why are you calculating distance? If the sole purpose is to display it,
print("The target is %.2fm away" % (distance(a, b)))
move along. But if you're comparing distances, doing range checks, etc., I'd like to add some useful performance observations.
Let’s take two cases: sorting by distance or culling a list to items that meet a range constraint.
# Ultra naive implementations. Hold onto your hat.
def sort_things_by_distance(origin, things):
return things.sort(key=lambda thing: distance(origin, thing))
def in_range(origin, range, things):
things_in_range = []
for thing in things:
if distance(origin, thing) <= range:
things_in_range.append(thing)
The first thing we need to remember is that we are using Pythagoras to calculate the distance (dist = sqrt(x^2 + y^2 + z^2)
) so we're making a lot of sqrt
calls. Math 101:
dist = root ( x^2 + y^2 + z^2 )
:.
dist^2 = x^2 + y^2 + z^2
and
sq(N) < sq(M) iff M > N
and
sq(N) > sq(M) iff N > M
and
sq(N) = sq(M) iff N == M
In short: until we actually require the distance in a unit of X rather than X^2, we can eliminate the hardest part of the calculations.
# Still naive, but much faster.
def distance_sq(left, right):
""" Returns the square of the distance between left and right. """
return (
((left.x - right.x) ** 2) +
((left.y - right.y) ** 2) +
((left.z - right.z) ** 2)
)
def sort_things_by_distance(origin, things):
return things.sort(key=lambda thing: distance_sq(origin, thing))
def in_range(origin, range, things):
things_in_range = []
# Remember that sqrt(N)**2 == N, so if we square
# range, we don't need to root the distances.
range_sq = range**2
for thing in things:
if distance_sq(origin, thing) <= range_sq:
things_in_range.append(thing)
Great, both functions no-longer do any expensive square roots. That'll be much faster. We can also improve in_range by converting it to a generator:
def in_range(origin, range, things):
range_sq = range**2
yield from (thing for thing in things
if distance_sq(origin, thing) <= range_sq)
This especially has benefits if you are doing something like:
if any(in_range(origin, max_dist, things)):
...
But if the very next thing you are going to do requires a distance,
for nearby in in_range(origin, walking_distance, hotdog_stands):
print("%s %.2fm" % (nearby.name, distance(origin, nearby)))
consider yielding tuples:
def in_range_with_dist_sq(origin, range, things):
range_sq = range**2
for thing in things:
dist_sq = distance_sq(origin, thing)
if dist_sq <= range_sq: yield (thing, dist_sq)
This can be especially useful if you might chain range checks ('find things that are near X and within Nm of Y', since you don't have to calculate the distance again).
But what about if we're searching a really large list of things
and we anticipate a lot of them not being worth consideration?
There is actually a very simple optimization:
def in_range_all_the_things(origin, range, things):
range_sq = range**2
for thing in things:
dist_sq = (origin.x - thing.x) ** 2
if dist_sq <= range_sq:
dist_sq += (origin.y - thing.y) ** 2
if dist_sq <= range_sq:
dist_sq += (origin.z - thing.z) ** 2
if dist_sq <= range_sq:
yield thing
Whether this is useful will depend on the size of 'things'.
def in_range_all_the_things(origin, range, things):
range_sq = range**2
if len(things) >= 4096:
for thing in things:
dist_sq = (origin.x - thing.x) ** 2
if dist_sq <= range_sq:
dist_sq += (origin.y - thing.y) ** 2
if dist_sq <= range_sq:
dist_sq += (origin.z - thing.z) ** 2
if dist_sq <= range_sq:
yield thing
elif len(things) > 32:
for things in things:
dist_sq = (origin.x - thing.x) ** 2
if dist_sq <= range_sq:
dist_sq += (origin.y - thing.y) ** 2 + (origin.z - thing.z) ** 2
if dist_sq <= range_sq:
yield thing
else:
... just calculate distance and range-check it ...
And again, consider yielding the dist_sq. Our hotdog example then becomes:
# Chaining generators
info = in_range_with_dist_sq(origin, walking_distance, hotdog_stands)
info = (stand, dist_sq**0.5 for stand, dist_sq in info)
for stand, dist in info:
print("%s %.2fm" % (stand, dist))
Solution 5
Another instance of this problem solving method:
def dist(x,y):
return numpy.sqrt(numpy.sum((x-y)**2))
a = numpy.array((xa,ya,za))
b = numpy.array((xb,yb,zb))
dist_a_b = dist(a,b)
Comments
-
Nathan Fellman almost 2 years
I have two points in 3D:
(xa, ya, za) (xb, yb, zb)
And I want to calculate the distance:
dist = sqrt((xa-xb)^2 + (ya-yb)^2 + (za-zb)^2)
What's the best way to do this with NumPy, or with Python in general? I have:
import numpy a = numpy.array((xa ,ya, za)) b = numpy.array((xb, yb, zb))
-
u0b34a0f6ae over 14 yearscan you use numpy's sqrt and/or sum implementations? That should make it faster (?).
-
u0b34a0f6ae over 14 yearsI found this on the other side of the interwebs
norm = lambda x: N.sqrt(N.square(x).sum())
;norm(x-y)
-
Mark Lavin over 14 yearsThe linalg.norm docs can be found here: docs.scipy.org/doc/numpy/reference/generated/… My only real comment was sort of pointing out the connection between a norm (in this case the Frobenius norm/2-norm which is the default for norm function) and a metric (in this case Euclidean distance).
-
Joe Kington over 13 yearsYou're badly misunderstanding how to use numpy... Don't use loops or list comprehensions. If you're iterating through, and applying the function to each item, then, yeah, the numpy functions will be slower. The whole point is to vectorize things.
-
user118662 over 13 yearsIf I move the numpy.array call into the loop where I am creating the points I do get better results with numpy_calc_dist, but it is still 10x slower than fastest_calc_dist. If I have that many points and I need to find the distance between each pair I'm not sure what else I can do to advantage numpy.
-
Scott B almost 13 yearsI realize this thread is old, but I just want to reinforce what Joe said. You are not using numpy correctly. What you are calculating is the sum of the distance from every point in p1 to every point in p2. The solution with numpy/scipy is over 70 times quicker on my machine. Make p1 and p2 into an array (even using a loop if you have them defined as dicts). Then you can get the total sum in one step,
scipy.spatial.distance.cdist(p1, p2).sum()
. That is it. -
Scott B almost 13 yearsOr use
numpy.linalg.norm(p1-p2).sum()
to get the sum between each point in p1 and the corresponding point in p2 (i.e. not every point in p1 to every point in p2). And if you do want every point in p1 to every point in p2 and don't want to use scipy as in my previous comment, then you can use np.apply_along_axis along with numpy.linalg.norm to still do it much, much quicker then your "fastest" solution. -
Nathan Fellman over 12 yearsthis will give me the square of the distance. you're missing a sqrt here.
-
Fred Foo over 10 yearsPrevious versions of NumPy had very slow norm implementations. In current versions, there's no need for all this.
-
Algold almost 9 yearsIf you look for efficiency it is better to use the numpy function. The scipy distance is twice as slow as numpy.linalg.norm(a-b) (and numpy.sqrt(numpy.sum((a-b)**2))). On my machine I get 19.7 µs with scipy (v0.15.1) and 8.9 µs with numpy (v1.9.2). Not a relevant difference in many cases but if in loop may become more significant. From a quick look at the scipy code it seems to be slower because it validates the array before computing the distance.
-
1a1a11a over 7 yearsbesides, if your p is multidimensional like more than 100, numpy is even better.
-
Alejandro Sazo about 7 yearsNumpy also accepts lists as inputs (no need to explicitly pass a numpy array)
-
mnky9800n about 7 yearsIf OP wanted to calculate the distance between an array of coordinates it is also possible to use scipy.spatial.distance.cdist.
-
Domenico Monaco over 6 yearsmy question is: why use this in opposite of this?stackoverflow.com/a/21986532/189411 from scipy.spatial import distance a = (1,2,3) b = (4,5,6) dst = distance.euclidean(a,b)
-
Keith over 6 yearsWhy not add such an optimized function to numpy? An extension for pandas would also be great for a question like this stackoverflow.com/questions/47643952/…
-
Avision over 6 years@MikePalmice yes, scipy functions are fully compatible with numpy. But take a look at what aigold suggested here (which also works on numpy array, of course)
-
3pitt over 6 years@Avision not sure if it will work for me since my matrices have different numbers of rows; trying to subtract them to get one matrix doesn't work
-
Avision over 6 years@MikePalmice what exactly are you trying to compute with these two matrices? what is the expected input/output?
-
3pitt over 6 yearsty for following up. There's a description here: stats.stackexchange.com/questions/322620/… . I have 2 tables of 'operations'; each has a 'code' label, but the two sets of labels are totally different. my goal is to find the best or closest code from the second table corresponding to a fixed code in the first (I know what the answer should be from manual inspection, but want to scale up to hundreds of tables later). So the first subset is fixed; I calculate avg euclid dist bw this and all code subsets of the 2nd, then sort
-
xskxzr over 6 yearsWhat's the difference from this answer?
-
Daniel Braun over 5 yearsOnly on 1-dimensional array tho
-
Bram Vanroy over 5 yearsI edited your first mathematical approach to distance. You were using a
pointZ
that didn't exist. I think what you meant was two points in three dimensional space and I edited accordingly. If I was wrong, please let me know. -
Tirtha R over 5 yearsThank you. I learnt something new today! For single dimension array, the string will be
i,i->
-
dragonLOLz about 5 yearsitd be evern more cool if there was a comparision of memory consumptions
-
Steven C. Howell about 5 yearsupdated link to SciPy's cdist function: docs.scipy.org/doc/scipy/reference/generated/…
-
Sigex about 5 yearsDoing maths directly in python is not a good idea as python is very slow, specifically
for a, b in zip(a, b)
. But useful none the less. -
Nico Schlömer almost 5 yearsThis is no longer applicable. (mpl 3.0)
-
Johannes Wiesner over 4 yearsI would like to use your code but I am struggling with understanding how the data is supposed to be organized. Can you give an example? How does
data
have to look like? -
Mad Physicist about 4 yearsReally neat project and findings. I've been doing some half-a***ed plots of the same nature, so I think I'll switch to your project and contribute the differences, if you like them.
-
ghost_programmer about 4 yearsthere are even more faster methods than numpy.linalg.norm: semantive.com/blog/…
-
Avinash almost 4 yearsSometimes it is giving NaN values in the column
-
Josmy Faure almost 4 yearsYou don't even need to zip a and b.
sqrt(sum( (a - b)**2))
would do the trick. Nice answer by the way -
NoBugs over 3 yearsYou should note it doesn't find the distance, it returns a numpy array containing the distance. How do you get the number that is the distance?
-
Doyousketch2 over 3 yearsPython 3.8+ math.hypot() isn't limited to 2 dimensions.
dist = math.hypot( xa-xb, ya-yb, za-zb )
-
Taylor Alexander almost 3 years@JohannesWiesner the parent says the shape must be (3,n). We can open a python terminal and see what that looks like. >>> np.zeros((3, 1)) array([[0.], [0.], [0.]]) Or for 5 values: >>> np.zeros((3, 5)) array([[0., 0., 0., 0., 0.], [0., 0., 0., 0., 0.], [0., 0., 0., 0., 0.]])
-
rubengavidia0x about 2 years+1 Nice approach using 1e+200 values, But I think hypo doesn't work now for three arguments, I have TypeError: hypot() takes exactly 2 arguments (3 given)
-
eroot163pi about 2 yearsYes for numpy hypot, it takes only two arguments...that's the reason why in speed comparison I use np.sqrt(np.sum