What are the differences between feather and parquet?

47,914
  • Parquet format is designed for long-term storage, where Arrow is more intended for short term or ephemeral storage (Arrow may be more suitable for long-term storage after the 1.0.0 release happens, since the binary format will be stable then)

  • Parquet is more expensive to write than Feather as it features more layers of encoding and compression. Feather is unmodified raw columnar Arrow memory. We will probably add simple compression to Feather in the future.

  • Due to dictionary encoding, RLE encoding, and data page compression, Parquet files will often be much smaller than Feather files

  • Parquet is a standard storage format for analytics that's supported by many different systems: Spark, Hive, Impala, various AWS services, in future by BigQuery, etc. So if you are doing analytics, Parquet is a good option as a reference storage format for query by multiple systems

The benchmarks you showed are going to be very noisy since the data you read and wrote is very small. You should try compressing at least 100MB or upwards 1GB of data to get some more informative benchmarks, see e.g. http://wesmckinney.com/blog/python-parquet-multithreading/

Hope this helps

Share:
47,914
Darkonaut
Author by

Darkonaut

featured: Python multiprocessing: understanding logic behind chunksize

Updated on January 05, 2020

Comments

  • Darkonaut
    Darkonaut over 4 years

    Both are columnar (disk-)storage formats for use in data analysis systems. Both are integrated within Apache Arrow (pyarrow package for python) and are designed to correspond with Arrow as a columnar in-memory analytics layer.

    How do both formats differ?

    Should you always prefer feather when working with pandas when possible?

    What are the use cases where feather is more suitable than parquet and the other way round?


    Appendix

    I found some hints here https://github.com/wesm/feather/issues/188, but given the young age of this project, it's possibly a bit out of date.

    Not a serious speed test because I'm just dumping and loading a whole Dataframe but to give you some impression if you never heard of the formats before:

     # IPython    
    import numpy as np
    import pandas as pd
    import pyarrow as pa
    import pyarrow.feather as feather
    import pyarrow.parquet as pq
    import fastparquet as fp
    
    
    df = pd.DataFrame({'one': [-1, np.nan, 2.5],
                       'two': ['foo', 'bar', 'baz'],
                       'three': [True, False, True]})
    
    print("pandas df to disk ####################################################")
    print('example_feather:')
    %timeit feather.write_feather(df, 'example_feather')
    # 2.62 ms ± 35.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    print('example_parquet:')
    %timeit pq.write_table(pa.Table.from_pandas(df), 'example.parquet')
    # 3.19 ms ± 51 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    print()
    
    print("for comparison:")
    print('example_pickle:')
    %timeit df.to_pickle('example_pickle')
    # 2.75 ms ± 18.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    print('example_fp_parquet:')
    %timeit fp.write('example_fp_parquet', df)
    # 7.06 ms ± 205 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
    print('example_hdf:')
    %timeit df.to_hdf('example_hdf', 'key_to_store', mode='w', table=True)
    # 24.6 ms ± 4.45 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
    print()
    
    print("pandas df from disk ##################################################")
    print('example_feather:')
    %timeit feather.read_feather('example_feather')
    # 969 µs ± 1.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    print('example_parquet:')
    %timeit pq.read_table('example.parquet').to_pandas()
    # 1.9 ms ± 5.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    
    print("for comparison:")
    print('example_pickle:')
    %timeit pd.read_pickle('example_pickle')
    # 1.07 ms ± 6.21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
    print('example_fp_parquet:')
    %timeit fp.ParquetFile('example_fp_parquet').to_pandas()
    # 4.53 ms ± 260 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
    print('example_hdf:')
    %timeit pd.read_hdf('example_hdf')
    # 10 ms ± 43.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    # pandas version: 0.22.0
    # fastparquet version: 0.1.3
    # numpy version: 1.13.3
    # pandas version: 0.22.0
    # pyarrow version: 0.8.0
    # sys.version: 3.6.3
    # example Dataframe taken from https://arrow.apache.org/docs/python/parquet.html
    
  • Darkonaut
    Darkonaut over 6 years
    Thank you very much Wes! If you add simple compression to Feather in the future, would it be possible to make it optional? I'm considering feather as a alternative format to json for an event storage in a stream processing system with fairly low latency requirements. So for this use case my concern would mainly be speed, not storage size.
  • Wes McKinney
    Wes McKinney over 6 years
    Yes, "uncompressed" will always be an option
  • Darkonaut
    Darkonaut over 6 years
    I noticed that your generate_floats function in your benchmark code here wesmckinney.com/blog/python-parquet-update doesn't guarantee unique_values. They are just random. With n=100M I got duplicates two out of ten runs. Just mentioning in case somebody uses this function where uniqueness should be guaranteed.
  • PascalVKooten
    PascalVKooten over 6 years
    @Darkonaut just wondering... compression results in smaller size so it would be quicker to read it into memory. It could be that the extra processing due to compressing/decompressing will be still faster than having to read more bytes. Or do you have a situation I'm not thinking of?
  • Darkonaut
    Darkonaut over 6 years
    @PascalvKooten That's an interesting remark, thanks! I have no idea how the tradeoff will be, depending on data sizes, but I surely will have to test and see.
  • Sairus
    Sairus over 5 years
    Can Parquet or Feather be compared to HDF5 format, since they have similar functionality?
  • static_rtti
    static_rtti almost 5 years
    One year later, is the situation still the same?
  • ivo Welch
    ivo Welch over 4 years
    HDF5 is more general and heavy...also a lot slower most of the time.
  • krisho
    krisho over 4 years
    Just to add an observation, 200,000 images in parquet format took 4 GB, but in feather took 6 GB. The data was read using pandas pd.read_parquet and pd.read_feather. pd.read_parquet took around 4 minutes, but pd.read_feather took 11 seconds. That is a huge difference. Reference: kaggle.com/corochann/…
  • HCSF
    HCSF about 4 years
    @WesMcKinney I noticed your answer was written back in 2018. After 2.3 years, do you still think Arrow (feather) is not good for long term storage (by comparing to Parquet)? Is there a specific reason? Like stability? format evolution? or?
  • anon01
    anon01 over 2 years
    W. McKinney indicates that feather (v2) is now stable here: stackoverflow.com/questions/64089691/…
  • Guillermo.D
    Guillermo.D almost 2 years
    Thanks for the answer, but the question is between feather and parquet, and you start your answer mentioning Arrow. This makes things even more confusing.