Loading compressed gzipped csv file in Spark 2.0

apache-spark pyspark

34,281

Solution 1

I just discovered that the following works with gzipped csv files:

spark.read.option("header", "true").csv("myfile.csv")

Solution 2

You can use spark.sparkContext.textFile("file.gz")

The file extension should be .gz

Solution 3

I am unsure if this changed between the writing of the answers here and when I came to this question, but I would like to insert my findings for the future reference of myself and others who also encounter this same issue. I was loading GZIP compressed CSV files into a PySpark DataFrame on Spark version 2.4.7 and python version 3.7.4 inside of Google's managed Spark-As-A-Service offering aka "Dataproc". The underlying Dataproc image version is 1.5-debian10 if you want to further investigate the specs.

My problem was I could not successfully read the CSV without all of the input still being garbled. I was able to make one small tweak by changing the ending of the filename so that the file suffix was .gz and then things worked perfectly after that. Here is the code to reproduce the issue.

# This is a shell script to get a dummy file created with 2 different endings
echo 'foo,bar,baz' > test.csv
gzip test.csv
# So now there are 2 files with 2 endings
cp test.csv.gz test_csv

I then can run the pyspark job or even an interactive pyspark session (pictured below) then to verify that spark doesn't intelligently detect the file type so much as it looks at the filename and interprets the file type based on its name.

$ pyspark
Python 3.7.4 (default, Aug 13 2019, 20:35:49) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  `_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.4.7
      /_/

Using Python version 3.7.4 (default, Aug 13 2019 20:35:49)
SparkSession available as 'spark'.
>>> filename_noend = 'test_csv'
>>> filename_end = 'test.csv.gz'
>>> schema = 'field1 string,field2 string,field3 string'
>>> df_noend = spark.read.csv(path=filename_noend, schema=schema, header=False)
>>> df_noend.show()
+--------------------+-------------+------+
|              field1|       field2|field3|
+--------------------+-------------+------+
���`test.cs...|�*.�+T+
                      |  null|
+--------------------+-------------+------+

>>> df_end = spark.read.csv(path=filename_end, schema=schema, header=False)
>>> df_end.show()
+------+------+------+
|field1|field2|field3|
+------+------+------+
|   foo|   bar|   baz|
+------+------+------+
>>> exit()

Sadly there is no way to specify something like compression='gzip' or whatever. So save your gzip compressed files with a .gz ending and you are good to go!

34,281

Author by

femibyte

Python & Java and now Scala are my first loves for programming. Working with Big Data & Cloud technologies - AWS, Kafka, Spark, Hadoop ecosystem.

Updated on July 30, 2022

Comments

femibyte over 1 year

How can I load a gzip compressed csv file in Pyspark on Spark 2.0 ?

I know that an uncompressed csv file can be loaded as follows:

spark.read.format("csv").option("header",          
                                "true").load("myfile.csv")

spark.read.option("header", "true").csv("myfile.csv")

femibyte over 7 years

This produces an rdd, not a DataFrame. Is there anyway of reading into a DataFrame directly, instead of having to convert the rdd to a DataFrame ?
femibyte over 7 years

Actually never mind, the following works with gzipped csv files: spark.read.option("header", "true").csv("myfile.csv")
Kanav Sharma about 6 years

thanks for the responses .. @Shankar however, this option is only giving me the file names inside the gz file but not the contents of that file
Kanav Sharma about 6 years

Edit: I got to correct the extension with proper small case. It was in caps. Thanks.
Cesar A. Mostacero over 4 years

Have you tried this solution using multiple csv.gzip files? It would be really awesome if that works.
Tim496 about 4 years

You can use the * wildcard - df = spark.read.option("header", "true").csv("some_path/*.gz"). It works across several folders too- df = spark.read.option("header", "true").csv("some_path/*/*.gz")