How can I write a parquet file using Spark (pyspark)?
Solution 1
The error was due to the fact that the textFile
method from SparkContext
returned an RDD
and what I needed was a DataFrame
.
SparkSession has a SQLContext
under the hood. So I needed to use the DataFrameReader
to read the CSV file correctly before converting it to a parquet file.
spark = SparkSession \
.builder \
.appName("Protob Conversion to Parquet") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
# read csv
df = spark.read.csv("/temp/proto_temp.csv")
# Displays the content of the DataFrame to stdout
df.show()
df.write.parquet("output/proto.parquet")
Solution 2
You can also write out Parquet files from Spark with koalas. This library is great for folks that prefer Pandas syntax. Koalas is PySpark under the hood.
Here's the Koala code:
import databricks.koalas as ks
df = ks.read_csv('/temp/proto_temp.csv')
df.to_parquet('output/proto.parquet')
ultraInstinct
Updated on July 09, 2022Comments
-
ultraInstinct almost 2 years
I'm pretty new in Spark and I've been trying to convert a Dataframe to a parquet file in Spark but I haven't had success yet. The documentation says that I can use write.parquet function to create the file. However, when I run the script it shows me: AttributeError: 'RDD' object has no attribute 'write'
from pyspark import SparkContext sc = SparkContext("local", "Protob Conversion to Parquet ") # spark is an existing SparkSession df = sc.textFile("/temp/proto_temp.csv") # Displays the content of the DataFrame to stdout df.write.parquet("/output/proto.parquet")
Do you know how to make this work?
The spark version that I'm using is Spark 2.0.1 built for Hadoop 2.7.3.
-
eliasah about 7 yearsEven if your code is correct, your explanation isn't. SparkContext doesn't convert the CSV file to an RDD. The
textFile
method from SparkContext returns an RDD and what you need is aDataFrame
thus a SQLContext or a HiveContext which is also encapsulated in a SparkSession in spark 2+ Would you care correcting that information and accept the answer to close the question ? -
ultraInstinct about 7 yearsThanks @eliasah for your feedback!
-
mnis.p almost 6 yearsThe answer is for dataframe. how can i write an rdd in parquet format?
-
Sowmya almost 4 yearsHi @Powers, I tried installing it
sc.install_pypi_package("koalas") #Install latest koalas version
while I was working on AWS EMR. However, when I tried importing it it saidNo module named 'koalas'
-
Powers almost 4 years@Sowmya - This link explains how to install pypi packages in an EMR environment: docs.aws.amazon.com/emr/latest/ReleaseGuide/…. Hope that helps!
-
Sowmya almost 4 yearsThanks. That's indeed nice of you to reply to my comment. I knew that link, just instead of doing a system installation, was thinking of more of a local or noteboook specific installation. Well, if the local one doesn't work out then would go for system installation.
-
Haha over 3 yearsdf.write.parquet takes the file folder as an argument and not its absolute path.
-
nam over 2 years@eliasah Does your comment mean that for spark 2+ we need only the following two lines to convert csv to Parquet:
df = spark.read.parquet("/path/to/infile.csv") df.write.csv("/path/to/outfile.parquet"
Did I get it right?