How to show the scheme (including type) of a parquet file from command line or spark shell?

scala apache-spark parquet

13,720

Solution 1

You should be able to do this:

sqlContext.read.parquet(path).printSchema()

From Spark docs:

// Print the schema in a tree format
df.printSchema()
// root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)

Solution 2

OK I think I have an OK way of doing it, just peek the first row to infer the scheme. (Though not sure just how elegant this is, what if it happens to be empty?? I'm sure there has to be a better solution)

sqlContext.parquetFile(p).first()

At some point prints:

{
  optional binary cust_id;
  optional binary blar;
  optional double foo;
}
 fileSchema: message schema {
  optional binary cust_id;
  optional binary blar;
  optional double foo;
}

13,720

Author by

samthebest

To make me answer a question I like to answer questions on Spark, Hadoop, Big Data and Scala. I'm pretty good at Bash, git and Linux, so I can sometimes answer these questions too. I've stopped checking my filters for new questions these days, so I'm probably not answering questions which I probably could. Therefore if you think I can help, especially with Spark and Scala, then rather than me give me email out, please comment on a similar question/answer of mine with a link. Furthermore cross-linking similar questions can be nice for general SO browsing and good for SEO. My favourite answers Round parenthesis are much much better than curly braces http://stackoverflow.com/a/27686566/1586965 Underscore evangelism and in depth explanation http://stackoverflow.com/a/25763401/1586965 Generalized memoization http://stackoverflow.com/a/19065888/1586965 Monad explained in basically 2 LOCs http://stackoverflow.com/a/20707480/1586965

Updated on June 17, 2022

Comments

samthebest almost 2 years

I have determined how to use the spark-shell to show the field names but it's ugly and does not include the type

val sqlContext = new org.apache.spark.sql.SQLContext(sc)

println(sqlContext.parquetFile(path))

prints:

ParquetTableScan [cust_id#114,blar_field#115,blar_field2#116], (ParquetRelation /blar/blar), None

Recents

Why Is PNG file with Drop Shadow in Flutter Web App Grainy?

How to troubleshoot crashes detected by Google Play Store for Flutter app

Cupertino DateTime picker interfering with scroll behaviour

Why does awk -F work for most letters, but not for the letter "t"?

Flutter change focus color and icon color but not works

How to print and connect to printer using flutter desktop via usb?

Critical issues have been reported with the following SDK versions: com.google.android.gms:play-services-safetynet:17.0.0

Flutter Dart - get localized country name from country code

navigatorState is null when using pushNamed Navigation onGenerateRoutes of GetMaterialPage

Android Sdk manager not found- Flutter doctor error

Flutter Laravel Push Notification without using any third party like(firebase,onesignal..etc)

How to change the color of ElevatedButton when entering text in TextField

Creating hive table using parquet file metadata

How to match Dataframe column names to Scala case class attributes?

SPARK DataFrame: How to efficiently split dataframe for each group based on same column values

Spark save(write) parquet only one file

Reading DataFrame from partitioned parquet file

How to save a partitioned parquet file in Spark 2.1?

Spark - Scala - Number of days between two dates

How to spark-submit with main class in jar?

Check equality for two Spark DataFrames in Scala

Dataframe: how to groupBy/count then order by count in Scala