Avro vs. Parquet

99,722

Solution 1

If you haven't already decided, I'd go ahead and write Avro schemas for your data. Once that's done, choosing between Avro container files and Parquet files is about as simple as swapping out e.g.,

job.setOutputFormatClass(AvroKeyOutputFormat.class);
AvroJob.setOutputKeySchema(MyAvroType.getClassSchema());

for

job.setOutputFormatClass(AvroParquetOutputFormat.class);
AvroParquetOutputFormat.setSchema(job, MyAvroType.getClassSchema());

The Parquet format does seem to be a bit more computationally intensive on the write side--e.g., requiring RAM for buffering and CPU for ordering the data etc. but it should reduce I/O, storage and transfer costs as well as make for efficient reads especially with SQL-like (e.g., Hive or SparkSQL) queries that only address a portion of the columns.

In one project, I ended up reverting from Parquet to Avro containers because the schema was too extensive and nested (being derived from some fairly hierarchical object-oriented classes) and resulted in 1000s of Parquet columns. In turn, our row groups were really wide and shallow which meant that it took forever before we could process a small number of rows in the last column of each group.

I haven't had much chance to use Parquet for more normalized/sane data yet but I understand that if used well, it allows for significant performance improvements.

Solution 2

Avro is a Row based format. If you want to retrieve the data as a whole you can use Avro

Parquet is a Column based format. If your data consists of a lot of columns but you are interested in a subset of columns then you can use Parquet

HBase is useful when frequent updating of data is involved. Avro is fast in retrieval, Parquet is much faster.

Solution 3

Avro

  • Widely used as a serialization platform
  • Row-based, offers a compact and fast binary format
  • Schema is encoded on the file so the data can be untagged
  • Files support block compression and are splittable
  • Supports schema evolution

Parquet

  • Column-oriented binary file format
  • Uses the record shredding and assembly algorithm described in the Dremel paper
  • Each data file contains the values for a set of rows
  • Efficient in terms of disk I/O when specific columns need to be queried

From Choosing an HDFS data storage format- Avro vs. Parquet and more

Solution 4

Both Avro and Parquet are "self-describing" storage formats, meaning that both embed data, metadata information and schema when storing data in a file. The use of either storage formats depends on the use case. Three aspects constitute the basis upon which you may choose which format will be optimal in your case:

  1. Read/Write operation: Parquet is a column-based file format. It supports indexing. Because of that it is suitable for write-once and read-intensive, complex or analytical querying, low-latency data queries. This is generally used by end users/data scientists.
    Meanwhile Avro, being a row-based file format, is best used for write-intensive operation. This is generally used by data engineers. Both support serialization and compression formats, although they do so in different ways.

  2. Tools: Parquet is a good fit for Impala. (Impala is a Massive Parallel Processing (MPP) RDBM SQL-query engine which knows how to operate on data that resides in one or a few external storage engines.) Again Parquet lends itself well to complex/interactive querying and fast (low-latency) outputs over data in HDFS. This is supported by CDH (Cloudera Distribution Hadoop). Hadoop supports Apache's Optimized Row Columnar (ORC) formats (selections depends on the Hadoop distribution), whereas Avro is best suited to Spark processing.

  3. Schema Evolution: Evolving a DB schema means changing the DB's structure, therefore its data, and thus its query processing.
    Both Parquet and Avro supports schema evolution but to a varying degree.
    Parquet is good for 'append' operations, e.g. adding columns, but not for renaming columns unless 'read' is done by index.
    Avro is better suited for appending, deleting and generally mutating columns than Parquet. Historically Avro has provided a richer set of schema evolution possibilities than Parquet, and although their schema evolution capabilities tend to blur, Avro still shines in that area, when compared to Parquet.

Solution 5

Your understanding is right. In fact, we ran into a similar situation during data migration in our DWH. We chose Parquet over Avro as the disk saving we got was almost double than what we got with AVro. Also, the query processing time was much better than Avro. But yes, our queries were based on aggregation, column based operations etc. hence Parquet was predictably a clear winner.

We are using Hive 0.12 from CDH distro. You mentioned you are running into issues with Hive+Parquet, what are those? We did not encounter any.

Share:
99,722

Related videos on Youtube

Abhishek
Author by

Abhishek

I'm a polygot software developer trying to learn , learn & implement! :)

Updated on July 12, 2022

Comments

  • Abhishek
    Abhishek almost 2 years

    I'm planning to use one of the hadoop file format for my hadoop related project. I understand parquet is efficient for column based query and avro for full scan or when we need all the columns data!

    Before I proceed and choose one of the file format, I want to understand what are the disadvantages/drawbacks of one over the other. Can anyone explain it to me in simple terms?

  • Abhishek
    Abhishek about 9 years
    Waiting for the comparison. Currently I chose Avro for my project as parquet has comptibility issues with hive :)
  • Tagar
    Tagar about 8 years
    Parquet supports nested datasets/ collections too.
  • steamer25
    steamer25 about 8 years
    @Ruslan: Yes, it did technically support the nested structures. The problem was the very high number of columns due to extensive de-normalization of the data. It worked but it was very slow.
  • Tagar
    Tagar about 8 years
    Yes, writing data in parquet is more expensive. Reads are other way around, especially if your queries normally read a subset of columns.
  • Rockie Yang
    Rockie Yang almost 8 years
    I think Parquet is suitable for most use cases except, data in the same column varies a lot, and always analysed on almost all columns.
  • E B
    E B almost 7 years
    @Abshinek, can you provide some info on the compatibility issues with hive and avro
  • OneCricketeer
    OneCricketeer over 6 years
    @EB There should not be any issues, if there are, they would be mentioned at cwiki.apache.org/confluence/display/Hive/AvroSerDe
  • ᐅdevrimbaris
    ᐅdevrimbaris about 5 years
    "Tools" part is a bit misleading. Parquet is efficiently used by lots of other frameworks like Spark, Presto, Hive etc. Avro is not specific to Spark, it is widely used as a HDFS storage format and message passing scenarios like in Kafka.
  • Cbhihe
    Cbhihe almost 5 years
    Aakash Aggarwal: Can you explain what you mean in paragraph 2 with "Avro is best fit for Spark processing" ? As mentionned by devrimbaris, Parquet is very well integrated in the Spark processing environment as well. o_O ?!?
  • josiah
    josiah almost 4 years
    Apache Arrow also does not yet support mixed nesting (lists with dictionaries or dictionaries with lists). So if you want to work with complex nesting in Parquet, you're stuck with Spark, Hive, etc. and such tools that don't rely on Arrow for reading and writing Parquet.
  • shadow0359
    shadow0359 almost 3 years
    parquet stores data on disk in a hybrid manner. It does a horizontal partition of the data and stores each partition it in a columnar way.