Convert Pandas dataframe to Sparse Numpy Matrix directly
61,467
df.values
is a numpy array, and accessing values that way is always faster than np.array
.
scipy.sparse.csr_matrix(df.values)
You might need to take the transpose first, like df.values.T
. In DataFrames, the columns are axis 0.
Author by
user7289
Updated on January 06, 2022Comments
-
user7289 over 2 years
I am creating a matrix from a Pandas dataframe as follows:
dense_matrix = np.array(df.as_matrix(columns = None), dtype=bool).astype(np.int)
And then into a sparse matrix with:
sparse_matrix = scipy.sparse.csr_matrix(dense_matrix)
Is there any way to go from a df straight to a sparse matrix?
Thanks in advance.
-
Jake0x32 almost 9 yearsBut this is suppose to generate a memory copy, isn't it? As df.values is essentially returning a dense matrix, and cast to csr_matrix handle. It doesn't work for large matrix.
-
Dan Allan almost 9 yearsNo, if I understand correctly
df.values
does not make a copy. -
ntg about 8 yearsAnother way would be to do e.g. df.replace(0, np.nan).to_sparse(), which results to a sparse DataFrame though, not a scipy.sparse.csr_matrix ...
-
Stan over 7 yearsdf.values creates a dense matrix if df is a SparseDataFrame. Impracticable for large dataset.
-
TheRajVJain over 6 yearsIf you want to convert nans to sparse, then you have to do fillna first and then convert.
-
SarahData over 5 years@Stan any solution in case of very large dataset ?
-
MehmedB over 4 yearsAnd btw. OP asked for a 'direct' solution. You are converting dataframe to numpy array and then csr_matrix. You are literally densifying the dataframe, creating 'object's by converting Nan's inside a dataframe. Am I missing something here? Why is this an accepted answer? I don't understand.
-
MehmedB over 4 yearsNow I guess I understand. Since df.values doesn't return a copy, this is actually a direct conversion?
-
Jiang Xiang over 2 yearsThis does not work because
df.values
is returning a regular numpy matrix. -
nomad culture over 2 yearsI got this error:
AttributeError: Can only use the '.sparse' accessor with Sparse data.
I think pandas does not allow to run it directly. -
G. Cohen over 2 years
df
has to be a sparse data frame. Convert dense data frame to sparse one via:sparse_df = df.astype(pd.SparseDtype("float64",0)
-
G. Cohen over 2 yearsSee solution below please.