Convert Pandas dataframe to Sparse Numpy Matrix directly

61,467

df.values is a numpy array, and accessing values that way is always faster than np.array.

scipy.sparse.csr_matrix(df.values)

You might need to take the transpose first, like df.values.T. In DataFrames, the columns are axis 0.

Share:
61,467
user7289
Author by

user7289

Updated on January 06, 2022

Comments

  • user7289
    user7289 over 2 years

    I am creating a matrix from a Pandas dataframe as follows:

    dense_matrix = np.array(df.as_matrix(columns = None), dtype=bool).astype(np.int)
    

    And then into a sparse matrix with:

    sparse_matrix = scipy.sparse.csr_matrix(dense_matrix)
    

    Is there any way to go from a df straight to a sparse matrix?

    Thanks in advance.

  • Jake0x32
    Jake0x32 almost 9 years
    But this is suppose to generate a memory copy, isn't it? As df.values is essentially returning a dense matrix, and cast to csr_matrix handle. It doesn't work for large matrix.
  • Dan Allan
    Dan Allan almost 9 years
    No, if I understand correctly df.values does not make a copy.
  • ntg
    ntg about 8 years
    Another way would be to do e.g. df.replace(0, np.nan).to_sparse(), which results to a sparse DataFrame though, not a scipy.sparse.csr_matrix ...
  • Stan
    Stan over 7 years
    df.values creates a dense matrix if df is a SparseDataFrame. Impracticable for large dataset.
  • TheRajVJain
    TheRajVJain over 6 years
    If you want to convert nans to sparse, then you have to do fillna first and then convert.
  • SarahData
    SarahData over 5 years
    @Stan any solution in case of very large dataset ?
  • MehmedB
    MehmedB over 4 years
    And btw. OP asked for a 'direct' solution. You are converting dataframe to numpy array and then csr_matrix. You are literally densifying the dataframe, creating 'object's by converting Nan's inside a dataframe. Am I missing something here? Why is this an accepted answer? I don't understand.
  • MehmedB
    MehmedB over 4 years
    Now I guess I understand. Since df.values doesn't return a copy, this is actually a direct conversion?
  • Jiang Xiang
    Jiang Xiang over 2 years
    This does not work because df.values is returning a regular numpy matrix.
  • nomad culture
    nomad culture over 2 years
    I got this error: AttributeError: Can only use the '.sparse' accessor with Sparse data. I think pandas does not allow to run it directly.
  • G. Cohen
    G. Cohen over 2 years
    df has to be a sparse data frame. Convert dense data frame to sparse one via:sparse_df = df.astype(pd.SparseDtype("float64",0)
  • G. Cohen
    G. Cohen over 2 years
    See solution below please.