Efficiently Calculating a Euclidean Distance Matrix Using Numpy

python numpy matrix performance euclidean-distance

52,187

Solution 1

You can take advantage of the complex type :

# build a complex array of your cells
z = np.array([complex(c.m_x, c.m_y) for c in cells])

First solution

# mesh this array so that you will have all combinations
m, n = np.meshgrid(z, z)
# get the distance via the norm
out = abs(m-n)

Second solution

Meshing is the main idea. But numpy is clever, so you don't have to generate m & n. Just compute the difference using a transposed version of z. The mesh is done automatically :

out = abs(z[..., np.newaxis] - z)

Third solution

And if z is directly set as a 2-dimensional array, you can use z.T instead of the weird z[..., np.newaxis]. So finally, your code will look like this :

z = np.array([[complex(c.m_x, c.m_y) for c in cells]]) # notice the [[ ... ]]
out = abs(z.T-z)

Example

>>> z = np.array([[0.+0.j, 2.+1.j, -1.+4.j]])
>>> abs(z.T-z)
array([[ 0.        ,  2.23606798,  4.12310563],
       [ 2.23606798,  0.        ,  4.24264069],
       [ 4.12310563,  4.24264069,  0.        ]])

As a complement, you may want to remove duplicates afterwards, taking the upper triangle :

>>> np.triu(out)
array([[ 0.        ,  2.23606798,  4.12310563],
       [ 0.        ,  0.        ,  4.24264069],
       [ 0.        ,  0.        ,  0.        ]])

Some benchmarks

>>> timeit.timeit('abs(z.T-z)', setup='import numpy as np;z = np.array([[0.+0.j, 2.+1.j, -1.+4.j]])')
4.645645342274779
>>> timeit.timeit('abs(z[..., np.newaxis] - z)', setup='import numpy as np;z = np.array([0.+0.j, 2.+1.j, -1.+4.j])')
5.049334864854522
>>> timeit.timeit('m, n = np.meshgrid(z, z); abs(m-n)', setup='import numpy as np;z = np.array([0.+0.j, 2.+1.j, -1.+4.j])')
22.489568296184686

Solution 2

If you don't need the full distance matrix, you will be better off using kd-tree. Consider scipy.spatial.cKDTree or sklearn.neighbors.KDTree. This is because a kd-tree kan find k-nearnest neighbors in O(n log n) time, and therefore you avoid the O(n**2) complexity of computing all n by n distances.

Solution 3

Jake Vanderplas gives this example using broadcasting in Python Data Science Handbook, which is very similar to what @shx2 proposed.

import numpy as np
rand = random.RandomState(42)
X = rand.rand(3, 2)  
dist_sq = np.sum((X[:, np.newaxis, :] - X[np.newaxis, :, :]) ** 2, axis = -1)

dist_sq
array([[0.        , 0.18543317, 0.81602495],
       [0.18543317, 0.        , 0.22819282],
       [0.81602495, 0.22819282, 0.        ]])

Solution 4

Here is how you can do it using numpy:

import numpy as np

x = np.array([0,1,2])
y = np.array([2,4,6])

# take advantage of broadcasting, to make a 2dim array of diffs
dx = x[..., np.newaxis] - x[np.newaxis, ...]
dy = y[..., np.newaxis] - y[np.newaxis, ...]
dx
=> array([[ 0, -1, -2],
          [ 1,  0, -1],
          [ 2,  1,  0]])

# stack in one array, to speed up calculations
d = np.array([dx,dy])
d.shape
=> (2, 3, 3)

Now all is left is computing the L2-norm along the 0-axis (as discussed here):

(d**2).sum(axis=0)**0.5
=> array([[ 0.        ,  2.23606798,  4.47213595],
          [ 2.23606798,  0.        ,  2.23606798],
          [ 4.47213595,  2.23606798,  0.        ]])

View more solutions

52,187

Wes Modes

Wes Modes is a Santa Cruz artist and art researcher working in the Digital Art and New Media program at the University of California Santa Cruz. In various lives, he is a sculptor, writer, performer, community organizer, programmer, geek, and mischief-maker. Current projects: Secret History is a journey to discover and collect the lost narratives of people who live and work on the river from the deck of a recreated shantyboat and present these stories through web-based digital archives and a touring art installation. http://peoplesriverhistory.us Co-related Space is a digitally-enhanced environment that highlights participants’ active engagement and experimentation with sound and light, including complex direct and indirect behavior and relationships. Using motion tracking, laser light projection and a generative soundscape, it encourages interactions between participants, visually and sonically transforming a regularly trafficked space. http://corelatedspace.com

Updated on July 09, 2022

Comments

Wes Modes almost 2 years
I have a set of points in 2-dimensional space and need to calculate the distance from each point to each other point.

I have a relatively small number of points, maybe at most 100. But since I need to do it often and rapidly in order to determine the relationships between these moving points, and since I'm aware that iterating through the points could be as bad as O(n^2) complexity, I'm looking for ways to take advantage of numpy's matrix magic (or scipy).

As it stands in my code, the coordinates of each object are stored in its class. However, I could also update them in a numpy array when I update the class coordinate.
```
class Cell(object):
    """Represents one object in the field."""
    def __init__(self,id,x=0,y=0):
        self.m_id = id
        self.m_x = x
        self.m_y = y
```
It occurs to me to create a Euclidean distance matrix to prevent duplication, but perhaps you have a cleverer data structure.

I'm open to pointers to nifty algorithms as well.

Also, I note that there are similar questions dealing with Euclidean distance and numpy but didn't find any that directly address this question of efficiently populating a full distance matrix.
- Carsten about 10 years
  
  Here, this might help: scipy.spatial.distance.pdist
- Jaime about 10 years
  
  Complexity is going to be O(n^2) no matter what: the best you can do for a general set of points is to only compute n * (n - 1) / 2 distances, which is still O(n^2).
- Eric Gopak almost 5 years
  
  If scipy can be used, consider scipy.spatial.distance_matrix
Wes Modes about 10 years

Did you ever find the distance? If so, you lost me. Where did that happen?
Tweakimp over 5 years

scipy.spatial.distance.cdist is faster than this, 9 times in my test
Rich Pauloo over 5 years

@Tweakimp - you should write an answer with a call to %timeit, perhaps for a small (10x10) and large (1,000,000 x 1,000,000) distance matrix. That would be really useful information for people!
Tweakimp over 5 years

i can not use %timeit in my jupyter notebook because i used the online variant and it runs out of memory for arrays that big
Ramin Melikov about 4 years

This is a super fast solution.
japreiss almost 4 years

This solution is a great example of broadcasting, but it consumes Θ(n^2 * d) memory (where n is the number of vectors and d is the dimension), whereas an optimal solution would only consume O(n^2). (Confirmed by /usr/bin/time -v.)
Ramin Melikov over 3 years

How would you compute the Manhattan distance?
Lars Gebraad almost 3 years

This actually takes quite some memory if you have large x or y, while also being slow. SciPy's distance matrix should be quite somewhat faster.