How to convert sparse matrix to dense form using python

python numpy matrix scipy word-frequency

12,051

Solution 1

List comprehension is the easiest way:

new_list = [[b for _,b in sub] for sub in mx]

Result:

>>> new_list
[[2, 1, 1, 1, 1, 3, 4, 2, 5, 1], [1, 5, 2, 1, 1, 1, 1, 1, 1, 2], [2, 1, 1, 1, 2, 1, 1, 1, 1, 1]]

Solution 2

Here's a pretty hacky way to do what you're asking for :

dense = [[int(''.join(str(val) for _, val in doc))] for doc in mx]

Basically it converts each value from the nested tuples into a string and concatenates all of those strings together, then converts that back to an integer. Repeat for each element of mx.

Solution 3

Your source data do not really match any of the built-in formats supported by sparse matrices in SciPy (see http://docs.scipy.org/doc/scipy/reference/sparse.html and http://en.wikipedia.org/wiki/Sparse_matrix), so using .todense() will not really be productive here. In particular, if you have something like:

import numpy as np

my_sparseish_matrix = np.array([[(1, 2), (3, 4)]])

then my_sparseish_matrix will already be a dense numpy array ! Calling .todense() on it at that point will produce an error, and doesn't make sense anyway.

So my recommendation is to construct your dense array explicitly using a couple of for loops. To do this you'll need to know how many items are possible in your resulting vector -- call it N.

dense_vector = np.zeros((N, ), int)
for inner in mx:
    for index, value in inner:
        dense_vector[index] = value

12,051

Author by

Tiger1

Updated on June 04, 2022

Comments

Tiger1 almost 2 years
I have the following matrix which I believe is sparse. I tried converting to dense using the x.dense format but it never worked. Any suggestions as to how to do this?, thanks.
```
mx=[[(0, 2), (1, 1), (2, 1), (3, 1), (4, 1), (5, 3), (6, 4), (7, 2), (8, 5), (9, 1)], 
[(10, 1), (11, 5), (12, 2), (13, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 2)], 
[(27, 2), (28, 1), (29, 1), (30, 1), (31, 2), (32, 1), (33, 1), (34, 1), (35, 1), (36, 1)]]
```
someone put forward the solution below, but is there a better way?
```
def assign_coo_to_dense(sparse, dense):
    dense[sparse.row, sparse.col] = sparse.data
```
mx.todense(). Intended output should appear in this form:[[2,1,1,1,1,3,4], [1,5,2,1,1,1,1], [2,1,1,1,2,1,1,1]]
- Floris over 10 years
  
  Are you using numpy or scipy?
- Tiger1 over 10 years
  
  Hi Floris, I'm using numpy, but it seems most people have addressed similar problems using scipy.
- Saullo G. P. Castro over 10 years
  
  @Tiger1 is mx a matrix containing indices or values? In SciPy you will need a maximum dimension of 2 for the sparse matrix, which does not seem to be your case...
- Tiger1 over 10 years
  
  Hi Saullo, indices follow by values.
- Akavall over 10 years
  
  You need to use x.todense(), not x.dense().
- Tiger1 over 10 years
  
  Hi Akavall, I actually made used of x.todense() and got the following error message: AttributeError: 'list' object has no attribute 'todense'
- Floris over 10 years
  
  did you declare mx to be a numpy array?
- Tiger1 over 10 years
  
  @Floris, i actually forgot to declare it numpy array. I will try it now. Thanks.
- lmjohns3 over 10 years
  
  It sounds like the data structure you listed is of the form [[(index, value), ...], ...] -- that is, a list of lists, each containing a series of index, value pairs. But since there is only one index associated with each value, this makes me think your data is really a vector. Does the ordering of the lists indicate anything, perhaps the row structure of the matrix ? Or can we ignore the list-of-lists part of the structure ?
- Tiger1 over 10 years
  
  @LeifJohnson, smart observation. The data is a vector, to be more specific, it represents word frequencies, and in general mx is a list of lists.
- Tiger1 over 10 years
  
  @Floris, i got the same error message after declaring mx as numpy:AttributeError: 'list' object has no attribute 'todense. My goal is for the output to appear in this dense form: :[[2111134], [1521111], [21112111]]
- Akavall over 10 years
  
  Are you sure the output you want is : [[2111134], [1521111], [21112111]], not [[2,1,1,1,1,3,4], [1,5,2,1,1,1,1], [2,1,1,1,2,1,1,1]] ? The later seems much more useful.
- Tiger1 over 10 years
  
  Thanks Akavall, I forgot to put comma, and that explains why my code isn't working.
Tiger1 over 10 years

Thanks @Imjohns3, how can I know the value of N when the actual data set contains thousands of documents (up to a million items)? Here is a code that does that, and also maintains the order of items in the list:q=[] for doc in corpus_tfidf: j=([i[1] for i in doc]) q.append(j)
lmjohns3 over 10 years

Oh wow, that's totally different than what I thought you were asking ! It would be helpful to specify this in your question.
Tiger1 over 10 years

Hi Imjohns3, thanks for the solution. it worked but each item is supposed to be a list; values separated by comma.See question for update. Thanks
Floris over 10 years

Finally an answer that ignores the whole "what kind of data is this" red herring and gets to the "here is how you get from the input you have to the output you want".
Tiger1 over 10 years

@AKavall, thanks for the solution. Its exactly what I was looking for.