Spark SQL - difference between gzip vs snappy vs lzo compression formats

apache-spark gzip parquet snappy lzo

38,757

Solution 1

Just try them on your data.

lzo and snappy are fast compressors and very fast decompressors, but with less compression, as compared to gzip which compresses better, but is a little slower.

Solution 2

Compression Ratio : GZIP compression uses more CPU resources than Snappy or LZO, but provides a higher compression ratio.

General Usage : GZip is often a good choice for cold data, which is accessed infrequently. Snappy or LZO are a better choice for hot data, which is accessed frequently.

Snappy often performs better than LZO. It is worth running tests to see if you detect a significant difference.

Splittablity : If you need your compressed data to be splittable, BZip2, LZO, and Snappy formats are splittable, but GZip is not.

GZIP compresses data 30% more as compared to Snappy and 2x more CPU when reading GZIP data compared to one that is consuming Snappy data.

LZO focus on decompression speed at low CPU usage and higher compression at the cost of more CPU.

For longer term/static storage, the GZip compression is still better.

See extensive research and benchmark code and results in this article (Performance of various general compression algorithms – some of them are unbelievably fast!).

Solution 3

Use Snappy if you can handle higher disk usage for the performance benefits (lower CPU + Splittable).

When Spark switched from GZIP to Snappy by default, this was the reasoning:

Based on our tests, gzip decompression is very slow (< 100MB/s), making queries decompression bound. Snappy can decompress at ~ 500MB/s on a single core.

Snappy:

Storage Space: High
CPU Usage: Low
Splittable: Yes (1)

GZIP:

Storage Space: Medium
CPU Usage: Medium
Splittable: No

1) http://boristyukin.com/is-snappy-compressed-parquet-file-splittable/

Solution 4

Based on the data below, I'd say gzip wins outside of scenarios like streaming, where write-time latency would be important.

It's important to keep in mind that speed is essentially compute cost. However, cloud compute is a one-time cost whereas cloud storage is a recurring cost. The tradeoff depends on the retention period of the data.

Let's test speed and size with large and small parquet files in Python.

Results (large file, 117 MB):

        +----------+----------+--------------------------+
        | snappy   | gzip     | (gzip-snappy)/snappy*100 |
+-------+----------+----------+--------------------------+
| write | 1.62 ms  | 7.65 ms  | 372% slower              |
+-------+----------+----------+--------------------------+
| size  | 35484122 | 17269656 |  51% smaller             |
+-------+----------+----------+--------------------------+
| read  | 973 ms   | 1140 ms  |  17% slower              |
+-------+----------+----------+--------------------------+

Results (small file, 4 KB, Iris dataset):

        +---------+---------+--------------------------+
        | snappy  | gzip    | (gzip-snappy)/snappy*100 |
+-------+---------+---------+--------------------------+
| write | 1.56 ms | 2.09 ms | 33.9% slower             |
+-------+---------+---------+--------------------------+
| size  | 6990    | 6647    |  5.2% smaller            |
+-------+---------+---------+--------------------------+
| read  | 3.22 ms | 3.44 ms |  6.8% slower             |
+-------+---------+---------+--------------------------+

small_file.ipynb

import os, sys
import pyarrow
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris

iris = load_iris()

df = pd.DataFrame(
    data= np.c_[iris['data'], iris['target']],
    columns= iris['feature_names'] + ['target']
)

# ========= WRITE =========
%timeit df.to_parquet(path='iris.parquet.snappy', compression='snappy', engine='pyarrow', index=True)
# 1.56 ms

%timeit df.to_parquet(path='iris.parquet.gzip', compression='snappy', engine='pyarrow', index=True)
# 2.09 ms

# ========= SIZE =========
os.stat('iris.parquet.snappy').st_size
# 6990

os.stat('iris.parquet.gzip').st_size
# 6647

# ========= READ =========
%timeit pd.read_parquet(path='iris.parquet.snappy', engine='pyarrow')
# 3.22 ms

%timeit pd.read_parquet(path='iris.parquet.gzip', engine='pyarrow')
# 3.44 ms

large_file.ipynb

import os, sys
import pyarrow
import pandas as pd

df = pd.read_csv('file.csv')

# ========= WRITE =========
%timeit df.to_parquet(path='file.parquet.snappy', compression='snappy', engine='pyarrow', index=True)
# 1.62 s

%timeit df.to_parquet(path='file.parquet.gzip', compression='gzip', engine='pyarrow', index=True)
# 7.65 s

# ========= SIZE =========
os.stat('file.parquet.snappy').st_size
# 35484122

os.stat('file.parquet.gzip').st_size
# 17269656

# ========= READ =========
%timeit pd.read_parquet(path='file.parquet.snappy', engine='pyarrow')
# 973 ms

%timeit pd.read_parquet(path='file.parquet.gzip', engine='pyarrow')
# 1.14 s

Solution 5

I agree with 1 answer(@Mark Adler) and have some reserch info[1], but I do not agree with the second answer(@Garren S)[2]. Maybe Garren misunderstood the question, because: [2] Parquet splitable with all supported codecs:Is gzipped Parquet file splittable in HDFS for Spark? , Tom White's Hadoop: The Definitive Guide, 4-th edition, Chapter 5: Hadoop I/O, page 106. [1] My reserch: source data - 205 GB. Text(separated fields), not compressed. output data:

<!DOCTYPE html>
<html>

<head>
  <style>
    table,
    th,
    td {
      border: 1px solid black;
      border-collapse: collapse;
    }
  </style>
</head>

<body>

  <table style="width:100%">
    <tr>
      <th></th>
      <th>time of computing, hours</th>
      <th>volume, GB</th>
    </tr>
    <tr>
      <td>ORC with default codec</td>
      <td>3-3,5</td>
      <td>12.3</td>
    </tr>
    <tr>
      <td>Parquet with GZIP</td>
      <td>3,5-3,7</td>
      <td>12.9</td>
    </tr>
    <tr>
      <td>Parquet with SNAPPY</td>
      <td>2,5-3,0</td>
      <td>60.4</td>
    </tr>
  </table>

</body>

</html>

Transformation was performed using Hive on an EMR consisting of 2 m4.16xlarge. Transformation - select all fields with ordering by several fields. This research, of course, is not standard, but at least a little shows the real comparison. With other datasets and computation results may be different.

View more solutions

38,757

Author by

Shankar

Love Open Source and Big Data Technologies.

Updated on November 29, 2020

Comments

Shankar over 3 years

I am trying to use Spark SQL to write parquet file.

By default Spark SQL supports gzip, but it also supports other compression formats like snappy and lzo.

What is the difference between these compression formats?