Parquet vs ORC vs ORC with Snappy

65,471

Solution 1

I would say, that both of these formats have their own advantages.

Parquet might be better if you have highly nested data, because it stores its elements as a tree like Google Dremel does (See here).
Apache ORC might be better if your file-structure is flattened.

And as far as I know parquet does not support Indexes yet. ORC comes with a light weight Index and since Hive 0.14 an additional Bloom Filter which might be helpful the better query response time especially when it comes to sum operations.

The Parquet default compression is SNAPPY. Are Table A - B - C and D holding the same Dataset? If yes it looks like there is something shady about it, when it only compresses to 1.9 GB

Solution 2

You are seeing this because:

  • Hive has a vectorized ORC reader but no vectorized parquet reader.

  • Spark has a vectorized parquet reader and no vectorized ORC reader.

  • Spark performs best with parquet, hive performs best with ORC.

I've seen similar differences when running ORC and Parquet with Spark.

Vectorization means that rows are decoded in batches, dramatically improving memory locality and cache utilization.

(correct as of Hive 2.0 and Spark 2.1)

Solution 3

Both Parquet and ORC have their own advantages and disadvantages. But I simply try to follow a simple rule of thumb - "How nested is your Data and how many columns are there". If you follow the Google Dremel you can find how parquet is designed. They user a hierarchal tree-like structure to store data. More the nesting deeper the tree.

But ORC is designed for a flattened file store. So if your Data is flattened with fewer columns, you can go with ORC, otherwise, parquet would be fine for you. Compression on flattened Data works amazingly in ORC.

We did some benchmarking with a larger flattened file, converted it to spark Dataframe and stored it in both parquet and ORC format in S3 and did querying with **Redshift-Spectrum **.

Size of the file in parquet: ~7.5 GB and took 7 minutes to write
Size of the file in ORC: ~7.1. GB and took 6 minutes to write
Query seems faster in ORC files.

Soon we will do some benchmarking for nested Data and update the results here.

Solution 4

We did some benchmark comparing the different file formats (Avro, JSON, ORC, and Parquet) in different use cases.

https://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet

The data is all publicly available and benchmark code is all open source at:

https://github.com/apache/orc/tree/branch-1.4/java/bench

Solution 5

Both of them have their advantages. We use Parquet at work together with Hive and Impala, but just wanted to point a few advantages of ORC over Parquet: during long-executing queries, when Hive queries ORC tables GC is called about 10 times less frequently. Might be nothing for many projects, but might be crucial for others.

ORC also takes much less time, when you need to select just a few columns from the table. Some other queries, especially with joins, also take less time because of vectorized query execution, which is not available for Parquet

Also, ORC compression is sometimes a bit random, while Parquet compression is much more consistent. It looks like when ORC table has many number columns - it doesn't compress as well. It affects both zlib and snappy compression

Share:
65,471
Rahul
Author by

Rahul

Updated on January 24, 2020

Comments

  • Rahul
    Rahul over 4 years

    I am running a few tests on the storage formats available with Hive and using Parquet and ORC as major options. I included ORC once with default compression and once with Snappy.

    I have read many a documents that state Parquet to be better in time/space complexity as compared to ORC but my tests are opposite to the documents I went through.

    Follows some details of my data.

    Table A- Text File Format- 2.5GB
    
    Table B - ORC - 652MB
    
    Table C - ORC with Snappy - 802MB
    
    Table D - Parquet - 1.9 GB
    

    Parquet was worst as far as compression for my table is concerned.

    My tests with the above tables yielded following results.

    Row count operation

    Text Format Cumulative CPU - 123.33 sec
    
    Parquet Format Cumulative CPU - 204.92 sec
    
    ORC Format Cumulative CPU - 119.99 sec 
    
    ORC with SNAPPY Cumulative CPU - 107.05 sec
    

    Sum of a column operation

    Text Format Cumulative CPU - 127.85 sec   
    
    Parquet Format Cumulative CPU - 255.2 sec   
    
    ORC Format Cumulative CPU - 120.48 sec   
    
    ORC with SNAPPY Cumulative CPU - 98.27 sec
    

    Average of a column operation

    Text Format Cumulative CPU - 128.79 sec
    
    Parquet Format Cumulative CPU - 211.73 sec    
    
    ORC Format Cumulative CPU - 165.5 sec   
    
    ORC with SNAPPY Cumulative CPU - 135.45 sec 
    

    Selecting 4 columns from a given range using where clause

    Text Format Cumulative CPU -  72.48 sec 
    
    Parquet Format Cumulative CPU - 136.4 sec       
    
    ORC Format Cumulative CPU - 96.63 sec 
    
    ORC with SNAPPY Cumulative CPU - 82.05 sec 
    

    Does that mean ORC is faster then Parquet? Or there is something that I can do to make it work better with query response time and compression ratio?

    Thanks!

  • Rahul
    Rahul over 8 years
    Table A - Text File Format - No Compression......... Table B - ORC file format with ZLIB compression......... Table C - ORC with Snappy....... Table D - Parquet with Snappy..... I worked on another table with ~150 columns and ~160 GB in size to check how the file formats perform there. Parquet took 35 GB to store that 160GB data while ORC with snappy took 39GB...... The compression looked way better for Parquet as compared to the test posted in question but performance was again on similar lines.. ORC shined here with even better performance than ORC+SNAPPY combination.
  • Rahul
    Rahul over 8 years
    The data structure for my use cases was flatter without any nesting. I agree to your indexing comment on Parquet vs ORC and that does make a difference actually. Do you have any results to share from the performance comparison of both? That might help to calm my conscience that I am implementing the formats correctly. :)
  • PhanThomas
    PhanThomas over 8 years
    I never tested my dataset on Parquet because the Index was a necessary requirement and we also have a flat data structure with no nested information. What i figured out is, that depending on where you store your files, you need a different stripe and file size to get best results. When you store your files permanently on HDFS it is better to have larger files and stripes. "set mapred.max.split.size=4096000000" was the parameter i used to influence the file size and a left the stripe size to its default value. With this setting it gave me about 94% query and compression boost.
  • PhanThomas
    PhanThomas over 8 years
    If you want to store your files on Amazon S3 as a cold storage a way smaller file and stripe size gave me much better results. i used files of the size of 40-60MB containing a single Stripe.
  • Steen
    Steen about 6 years
    As of 2.3.0, spark does have a vectorized ORC reader: issues.apache.org/jira/browse/SPARK-16060
  • Daniel Kats
    Daniel Kats about 6 years
    This is really useful, but there should be a disclaimer that @Owen works for Horton Works, which originally developed the ORC file format
  • ruhong
    ruhong almost 6 years
    Hive 2.3.0 has vectorized Parquet reader - issues.apache.org/jira/browse/HIVE-14815
  • Danilo Gomes
    Danilo Gomes over 5 years
    Thanks! But the second link is broken. Can you please fix or remove it from your answer?
  • Anurag Sharma
    Anurag Sharma about 5 years
    Since Spark 2.3, Spark supports a vectorized ORC reader spark.apache.org/docs/latest/sql-data-sources-orc.html