How can I write a parquet file using Spark (pyspark)?

129,898

Solution 1

The error was due to the fact that the textFile method from SparkContext returned an RDD and what I needed was a DataFrame.

SparkSession has a SQLContext under the hood. So I needed to use the DataFrameReader to read the CSV file correctly before converting it to a parquet file.

spark = SparkSession \
    .builder \
    .appName("Protob Conversion to Parquet") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

# read csv
df = spark.read.csv("/temp/proto_temp.csv")

# Displays the content of the DataFrame to stdout
df.show()

df.write.parquet("output/proto.parquet")

Solution 2

You can also write out Parquet files from Spark with koalas. This library is great for folks that prefer Pandas syntax. Koalas is PySpark under the hood.

Here's the Koala code:

import databricks.koalas as ks

df = ks.read_csv('/temp/proto_temp.csv')
df.to_parquet('output/proto.parquet')
Share:
129,898
ultraInstinct
Author by

ultraInstinct

Updated on July 09, 2022

Comments

  • ultraInstinct
    ultraInstinct almost 2 years

    I'm pretty new in Spark and I've been trying to convert a Dataframe to a parquet file in Spark but I haven't had success yet. The documentation says that I can use write.parquet function to create the file. However, when I run the script it shows me: AttributeError: 'RDD' object has no attribute 'write'

    from pyspark import SparkContext
    sc = SparkContext("local", "Protob Conversion to Parquet ")
    
    # spark is an existing SparkSession
    df = sc.textFile("/temp/proto_temp.csv")
    
    # Displays the content of the DataFrame to stdout
    df.write.parquet("/output/proto.parquet")
    

    Do you know how to make this work?

    The spark version that I'm using is Spark 2.0.1 built for Hadoop 2.7.3.

  • eliasah
    eliasah about 7 years
    Even if your code is correct, your explanation isn't. SparkContext doesn't convert the CSV file to an RDD. The textFile method from SparkContext returns an RDD and what you need is a DataFrame thus a SQLContext or a HiveContext which is also encapsulated in a SparkSession in spark 2+ Would you care correcting that information and accept the answer to close the question ?
  • ultraInstinct
    ultraInstinct about 7 years
    Thanks @eliasah for your feedback!
  • mnis.p
    mnis.p almost 6 years
    The answer is for dataframe. how can i write an rdd in parquet format?
  • Sowmya
    Sowmya almost 4 years
    Hi @Powers, I tried installing it sc.install_pypi_package("koalas") #Install latest koalas version while I was working on AWS EMR. However, when I tried importing it it said No module named 'koalas'
  • Powers
    Powers almost 4 years
    @Sowmya - This link explains how to install pypi packages in an EMR environment: docs.aws.amazon.com/emr/latest/ReleaseGuide/…. Hope that helps!
  • Sowmya
    Sowmya almost 4 years
    Thanks. That's indeed nice of you to reply to my comment. I knew that link, just instead of doing a system installation, was thinking of more of a local or noteboook specific installation. Well, if the local one doesn't work out then would go for system installation.
  • Haha
    Haha over 3 years
    df.write.parquet takes the file folder as an argument and not its absolute path.
  • nam
    nam over 2 years
    @eliasah Does your comment mean that for spark 2+ we need only the following two lines to convert csv to Parquet: df = spark.read.parquet("/path/to/infile.csv") df.write.csv("/path/to/outfile.parquet" Did I get it right?