How do I get schema / column names from parquet file?

hadoop apache-pig hdfs parquet

86,617

Solution 1

You won't be able "open" the file using a hdfs dfs -text because its not a text file. Parquet files are written to disk very differently compared to text files.

And for the same matter, the Parquet project provides parquet-tools to do tasks like which you are trying to do. Open and see the schema, data, metadata etc.

Check out the parquet-tool project (which is put simply, a jar file.) parquet-tools

Also Cloudera which support and contributes heavily to Parquet, also has a nice page with examples on usage of parquet-tools. A example from that page for your use case is

parquet-tools schema part-m-00000.parquet

Checkout the Cloudera page. Using the Parquet File Format with Impala, Hive, Pig, HBase, and MapReduce

Solution 2

If your Parquet files are located in HDFS or S3 like me, you can try something like the following:

HDFS

parquet-tools schema hdfs://<YOUR_NAME_NODE_IP>:8020/<YOUR_FILE_PATH>/<YOUR_FILE>.parquet

parquet-tools schema s3://<YOUR_BUCKET_PATH>/<YOUR_FILE>.parquet

Hope it helps.

Solution 3

If you use Docker you can also run parquet-tools in a container:

docker run -ti -v C:\file.parquet:/tmp/file.parquet nathanhowell/parquet-tools schema /tmp/file.parquet

Solution 4

parquet-cli is a light weight alternative to parquet-tools.

pip install parquet-cli          //installs via pip
parq filename.parquet            //view meta data
parq filename.parquet --schema   //view the schema
parq filename.parquet --head 10  //view top n rows

This tool will provide basic info about the parquet file.

Solution 5

Maybe it's capable to use a desktop application to view Parquet and also other binary format data like ORC and AVRO. It's pure Java application so that can be run at Linux, Mac and also Windows. Please check Bigdata File Viewer for details.

It supports complex data type like array, map, etc.

View more solutions

86,617

Author by

Super_John

Analytics / Data Science professional in the ad-tech domain. User of Python, R, Linux, SQL Also working on some web-development projects in Django. Plays around with Amazon AWS tools. Interested in deployment and dev-ops.

Updated on July 09, 2022

Comments

Super_John almost 2 years

I have a file stored in HDFS as part-m-00000.gz.parquet

I've tried to run hdfs dfs -text dir/part-m-00000.gz.parquet but it's compressed, so I ran gunzip part-m-00000.gz.parquet but it doesn't uncompress the file since it doesn't recognise the .parquet extension.

How do I get the schema / column names for this file?
Super_John over 8 years

Thank you. I wish - my current environment doesn't have hive, so I just have pig & hdfs for MR.
Super_John over 8 years

Thank you. Sounds like a lot more work than I expected!
Matteo Guarnerio over 8 years

Here is the updated repository for parquet-tools.
Avinav Mishra over 6 years

unless you know parquet column structure you will not be able to make HIVE table on top of it.
Itération 122442 about 3 years

None of the provided github links are working anymore :(
scravy almost 3 years

best way to run them
scravy almost 3 years

like them a lot better than the parquet-tools
Juha Syrjälä over 2 years

parquet-tools link is broken.
matmat over 2 years

parquet-tools threw an error about a missing footer, but parquet-cli worked for me.
Sandeep Singh about 2 years

Updated link of the tool pypi.org/project/parquet-tools