How to copy and convert parquet files to csv

python hadoop apache-spark pyspark parquet

52,449

Solution 1

Try

df = spark.read.parquet("/path/to/infile.parquet")
df.write.csv("/path/to/outfile.csv")

Relevant API documentation:

Both /path/to/infile.parquet and /path/to/outfile.csv should be locations on the hdfs filesystem. You can specify hdfs://... explicitly or you can omit it as usually it is the default scheme.

You should avoid using file://..., because a local file means a different file to every machine in the cluster. Output to HDFS instead then transfer the results to your local disk using the command line:

hdfs dfs -get /path/to/outfile.csv /path/to/localfile.csv

Or display it directly from HDFS:

hdfs dfs -cat /path/to/outfile.csv

Solution 2

If there is a table defined over those parquet files in Hive (or if you define such a table yourself), you can run a Hive query on that and save the results into a CSV file. Try something along the lines of:

insert overwrite local directory dirname
  row format delimited fields terminated by ','
  select * from tablename;

Substitute dirname and tablename with actual values. Be aware that any existing content in the specified directory gets deleted. See Writing data into the filesystem from queries for details.

Solution 3

Snippet for a more dynamic form, since you might not exactly know what's the name of your parquet file, will be:

for filename in glob.glob("[location_of_parquet_file]/*.snappy.parquet"):
        print filename
        df = sqlContext.read.parquet(filename)
        df.write.csv("[destination]")
        print "csv generated"

52,449

Author by

graffe

Updated on March 20, 2020

Comments

graffe about 4 years
I have access to a hdfs file system and can see parquet files with
```
hadoop fs -ls /user/foo
```
How can I copy those parquet files to my local system and convert them to csv so I can use them? The files should be simple text files with a number of fields per row.