sparse 3d matrix/array in Python?

python numpy scipy sparse-matrix

30,579

Solution 1

Happy to suggest a (possibly obvious) implementation of this, which could be made in pure Python or C/Cython if you've got time and space for new dependencies, and need it to be faster.

A sparse matrix in N dimensions can assume most elements are empty, so we use a dictionary keyed on tuples:

class NDSparseMatrix:
  def __init__(self):
    self.elements = {}

  def addValue(self, tuple, value):
    self.elements[tuple] = value

  def readValue(self, tuple):
    try:
      value = self.elements[tuple]
    except KeyError:
      # could also be 0.0 if using floats...
      value = 0
    return value

and you would use it like so:

sparse = NDSparseMatrix()
sparse.addValue((1,2,3), 15.7)
should_be_zero = sparse.readValue((1,5,13))

You could make this implementation more robust by verifying that the input is in fact a tuple, and that it contains only integers, but that will just slow things down so I wouldn't worry unless you're releasing your code to the world later.

EDIT - a Cython implementation of the matrix multiplication problem, assuming other tensor is an N Dimensional NumPy array (numpy.ndarray) might look like this:

#cython: boundscheck=False
#cython: wraparound=False

cimport numpy as np

def sparse_mult(object sparse, np.ndarray[double, ndim=3] u):
  cdef unsigned int i, j, k

  out = np.ndarray(shape=(u.shape[0],u.shape[1],u.shape[2]), dtype=double)

  for i in xrange(1,u.shape[0]-1):
    for j in xrange(1, u.shape[1]-1):
      for k in xrange(1, u.shape[2]-1):
        # note, here you must define your own rank-3 multiplication rule, which
        # is, in general, nontrivial, especially if LxMxN tensor...

        # loop over a dummy variable (or two) and perform some summation:
        out[i,j,k] = u[i,j,k] * sparse((i,j,k))

  return out

Although you will always need to hand roll this for the problem at hand, because (as mentioned in code comment) you'll need to define which indices you're summing over, and be careful about the array lengths or things won't work!

EDIT 2 - if the other matrix is also sparse, then you don't need to do the three way looping:

def sparse_mult(sparse, other_sparse):

  out = NDSparseMatrix()

  for key, value in sparse.elements.items():
    i, j, k = key
    # note, here you must define your own rank-3 multiplication rule, which
    # is, in general, nontrivial, especially if LxMxN tensor...

    # loop over a dummy variable (or two) and perform some summation 
    # (example indices shown):
    out.addValue(key) = out.readValue(key) + 
      other_sparse.readValue((i,j,k+1)) * sparse((i-3,j,k))

  return out

My suggestion for a C implementation would be to use a simple struct to hold the indices and the values:

typedef struct {
  int index[3];
  float value;
} entry_t;

you'll then need some functions to allocate and maintain a dynamic array of such structs, and search them as fast as you need; but you should test the Python implementation in place for performance before worrying about that stuff.

Solution 2

An alternative answer as of 2017 is the sparse package. According to the package itself it implements sparse multidimensional arrays on top of NumPy and scipy.sparse by generalizing the scipy.sparse.coo_matrix layout.

Here's an example taken from the docs:

import numpy as np
n = 1000
ndims = 4
nnz = 1000000
coords = np.random.randint(0, n - 1, size=(ndims, nnz))
data = np.random.random(nnz)

import sparse
x = sparse.COO(coords, data, shape=((n,) * ndims))
x
# <COO: shape=(1000, 1000, 1000, 1000), dtype=float64, nnz=1000000>

x.nbytes
# 16000000

y = sparse.tensordot(x, x, axes=((3, 0), (1, 2)))

y
# <COO: shape=(1000, 1000, 1000, 1000), dtype=float64, nnz=1001588>

Solution 3

Have a look at sparray - sparse n-dimensional arrays in Python (by Jan Erik Solem). Also available on github.

Solution 4

Nicer than writing everything new from scratch may be to use scipy's sparse module as far as possible. This may lead to (much) better performance. I had a somewhat similar problem, but I only had to access the data efficiently, not perform any operations on them. Furthermore, my data were only sparse in two out of three dimensions.

I have written a class that solves my problem and could (as far as I think) easily be extended to satisfiy the OP's needs. It may still hold some potential for improvement, though.

import scipy.sparse as sp
import numpy as np

class Sparse3D():
    """
    Class to store and access 3 dimensional sparse matrices efficiently
    """
    def __init__(self, *sparseMatrices):
        """
        Constructor
        Takes a stack of sparse 2D matrices with the same dimensions
        """
        self.data = sp.vstack(sparseMatrices, "dok")
        self.shape = (len(sparseMatrices), *sparseMatrices[0].shape)
        self._dim1_jump = np.arange(0, self.shape[1]*self.shape[0], self.shape[1])
        self._dim1 = np.arange(self.shape[0])
        self._dim2 = np.arange(self.shape[1])

    def __getitem__(self, pos):
        if not type(pos) == tuple:
            if not hasattr(pos, "__iter__") and not type(pos) == slice: 
                return self.data[self._dim1_jump[pos] + self._dim2]
            else:
                return Sparse3D(*(self[self._dim1[i]] for i in self._dim1[pos]))
        elif len(pos) > 3:
            raise IndexError("too many indices for array")
        else:
            if (not hasattr(pos[0], "__iter__") and not type(pos[0]) == slice or
                not hasattr(pos[1], "__iter__") and not type(pos[1]) == slice):
                if len(pos) == 2:
                    result = self.data[self._dim1_jump[pos[0]] + self._dim2[pos[1]]]
                else:
                    result = self.data[self._dim1_jump[pos[0]] + self._dim2[pos[1]], pos[2]].T
                    if hasattr(pos[2], "__iter__") or type(pos[2]) == slice:
                        result = result.T
                return result
            else:
                if len(pos) == 2:
                    return Sparse3D(*(self[i, self._dim2[pos[1]]] for i in self._dim1[pos[0]]))
                else:
                    if not hasattr(pos[2], "__iter__") and not type(pos[2]) == slice:
                        return sp.vstack([self[self._dim1[pos[0]], i, pos[2]]
                                          for i in self._dim2[pos[1]]]).T
                    else:
                        return Sparse3D(*(self[i, self._dim2[pos[1]], pos[2]] 
                                          for i in self._dim1[pos[0]]))

    def toarray(self):
        return np.array([self[i].toarray() for i in range(self.shape[0])])

View more solutions

30,579

zhongqi

Updated on February 16, 2021

Comments

zhongqi about 3 years

In scipy, we can construct a sparse matrix using scipy.sparse.lil_matrix() etc. But the matrix is in 2d.

I am wondering if there is an existing data structure for sparse 3d matrix / array (tensor) in Python?

p.s. I have lots of sparse data in 3d and need a tensor to store / perform multiplication. Any suggestions to implement such a tensor if there's no existing data structure?
- jayunit100 over 12 years
  
  this post might help stackoverflow.com/questions/4490961/…
- Steve Tjoa over 12 years
  
  ...but not sparse, unfortunately.
- Peter over 12 years
  
  What do you mean by a "matrix in 2D"? If you mean a matrix representing a 2D linear transformation, then you're talking about a 2x2 matrix of Real values (approximated by floating point values) with determinant 1 for a rigid rotation. If you want to represent translation as well then you embed the 2x2 matrix inside a 3x3 matrix, and if you want to allow shearing or expansion/contraction then you can relax the determinant requirement -but even so that's a total of 9 floating point values. Why do you want/need a sparse representation?
- zhongqi over 12 years
  
  @Peter "a matrix in 2D" means a matrix in 2 dimension. A unit in a 2d matrix can be represented as (x,y, r), where x & y are the coordinate and r is the value stored at (x, y). I need a sparse representation because when x & y are very very large, say x<10^5, y < 10^4, AND only very few data are stored in the matrix, say 10^4. numpy provides sparse matrix for the 2d matrix. But very often, we need 3d or even n-d. I guess n-d case is too general. So any solutions to 3d are good enough for me.
- Peter over 12 years
  
  Thanks - I was confused by the P.S. in your question (it sounded to me like you wanted to multiply a bunch of Euclidean tuples by a matrix, linear algebra style). But if you're talking about m x n x o matrices then it sounds like your "sparse" implementation is going to need to provide some sort of iterator interface in order for you to implement (element by element) multiplication.
Joe Kington over 12 years

The problem is the mathematical operations, not the data container... I've never heard of algorithms for efficient sparse N-d tensor products. Have a look scipy.sparse.dok_matrix. It's what you're describing here, just limited to 2D. It's easy enough to extend it to hold N-D data, but how do you operate on the data?? (That having been said, your answer is entirely reasonable...)
tehwalrus over 12 years

ah, I've misunderstood? so this question is asking more about the implementation of a scipy compatible matrix multiplication operation? This should be relatively easy to implement, surely, since all you really need for that is an in-loop query for the value at an index, which I've provided. I'll take a look at the scipy specifications though.
Joe Kington over 12 years

Well, arguably, I've misunderstood as well. Either way, my point was that you're not taking advantage of the sparsity structure of when doing operations. What you've described in your edit treats it like a dense array. (Which certainly works! Your answer solves the problem at hand.) Sparse matrix libraries take advantage of the sparse-ness of the array, and avoid things like looping over every element of the array, regardless of "sparsity". That's the main point of using a sparse matrix. Operations roughly scale with the number of "dense" elements, not the overall dimensions of the matrix.
zhongqi over 12 years

@tehwalrus Thanks for the reply. But I am afraid the multiplication with your suggested data structure may not be very efficient...
tehwalrus over 12 years

@JoeKington You'd have to loop over every element in the non-sparse array (u in this case) anyway, surely? Unless both are sparse, in which case I've misunderstood even more. In that case, you can simply loop over the keys in the dictionary, and extract the indices from the tuple. Anyway, I'm not up to speed on sparse algebra, let alone the computer science behind optimising algorithms on the topic. Sorry, @zhongqi !
Joe Kington over 12 years

"Unless both are sparse" <-- That's the idea! :) Sorry I wasn't clearer. I was referring to operations between two sparse arrays. (Which is usually the case, when you're dealing with a situation where using sparse matrices is a good solution.) Multiplications between two sparse arrays is the most common use case for a sparse array library, so there's a lot of effort put into optimizing that particular situation. Implementing an efficient sparse tensordot (i.e. multiplication for N-D) is nontrivial. Again, though, your solution is fine. (+1) It's impractical for large, sparse arrays, though
tehwalrus over 12 years

@JoeKington edit 2 may help, but I'm not sure. At some point, you'll have to deal with the data that is there, though, which means looping over your dummy indices...
TomCho over 6 years

I'm in the same situation and this is very useful. I think with a little extra work this could be implemented into scipy's sparse array module. Have you considered it?
Samufi over 6 years

@TomCho Thanks! I have not considered implementing this into scipy's sparse module. I think an implementation in scipy should support all the standard numpy matrix features. That would be doable but require a decent chunk of work. Also, I think adding C implementations for operations on these matrices would be much more efficient and more appropriate for scipy.
TomCho about 6 years

@JayShin Maybe, but to be honest, I think you'd have to test it out.
MrMartin over 5 years

I recommend writing "as of 2017" instead of "as of this year"
0xc0de about 3 years

Any reason for pointing to this fork and not to the original (github.com/pydata/sparse source of the fork) one? The one linked here isn't updated since 2018 while the the one I've linked is recently updated (4th Jan 2021).
TomCho about 3 years

@0xc0de The reason is that I wasn't paying enough attention! I fixed it with the original link. Thanks