How can I write a parquet file using Spark (pyspark)?

python pyspark spark-dataframe

129,898

Solution 1

The error was due to the fact that the textFile method from SparkContext returned an RDD and what I needed was a DataFrame.

SparkSession has a SQLContext under the hood. So I needed to use the DataFrameReader to read the CSV file correctly before converting it to a parquet file.

spark = SparkSession \
    .builder \
    .appName("Protob Conversion to Parquet") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

# read csv
df = spark.read.csv("/temp/proto_temp.csv")

# Displays the content of the DataFrame to stdout
df.show()

df.write.parquet("output/proto.parquet")

Solution 2

You can also write out Parquet files from Spark with koalas. This library is great for folks that prefer Pandas syntax. Koalas is PySpark under the hood.

Here's the Koala code:

import databricks.koalas as ks

df = ks.read_csv('/temp/proto_temp.csv')
df.to_parquet('output/proto.parquet')

129,898

Author by

ultraInstinct

Updated on July 09, 2022

Comments

ultraInstinct almost 2 years
I'm pretty new in Spark and I've been trying to convert a Dataframe to a parquet file in Spark but I haven't had success yet. The documentation says that I can use write.parquet function to create the file. However, when I run the script it shows me: AttributeError: 'RDD' object has no attribute 'write'
```
from pyspark import SparkContext
sc = SparkContext("local", "Protob Conversion to Parquet ")

# spark is an existing SparkSession
df = sc.textFile("/temp/proto_temp.csv")

# Displays the content of the DataFrame to stdout
df.write.parquet("/output/proto.parquet")
```
Do you know how to make this work?

The spark version that I'm using is Spark 2.0.1 built for Hadoop 2.7.3.
eliasah about 7 years

Even if your code is correct, your explanation isn't. SparkContext doesn't convert the CSV file to an RDD. The textFile method from SparkContext returns an RDD and what you need is a DataFrame thus a SQLContext or a HiveContext which is also encapsulated in a SparkSession in spark 2+ Would you care correcting that information and accept the answer to close the question ?
ultraInstinct about 7 years

Thanks @eliasah for your feedback!
mnis.p almost 6 years

The answer is for dataframe. how can i write an rdd in parquet format?
Sowmya almost 4 years

Hi @Powers, I tried installing it sc.install_pypi_package("koalas") #Install latest koalas version while I was working on AWS EMR. However, when I tried importing it it said No module named 'koalas'
Powers almost 4 years

@Sowmya - This link explains how to install pypi packages in an EMR environment: docs.aws.amazon.com/emr/latest/ReleaseGuide/…. Hope that helps!
Sowmya almost 4 years

Thanks. That's indeed nice of you to reply to my comment. I knew that link, just instead of doing a system installation, was thinking of more of a local or noteboook specific installation. Well, if the local one doesn't work out then would go for system installation.
Haha over 3 years

df.write.parquet takes the file folder as an argument and not its absolute path.
nam over 2 years

@eliasah Does your comment mean that for spark 2+ we need only the following two lines to convert csv to Parquet: df = spark.read.parquet("/path/to/infile.csv") df.write.csv("/path/to/outfile.parquet" Did I get it right?