Unable to infer schema when loading Parquet file

apache-spark pyspark parquet

126,249

Solution 1

This error usually occurs when you try to read an empty directory as parquet. Probably your outcome Dataframe is empty.

You could check if the DataFrame is empty with outcome.rdd.isEmpty() before writing it.

Solution 2

In my case, the error occured because I was trying to read a parquet file which started with an underscore (e.g. _lots_of_data.parquet). Not sure why this was an issue, but removing the leading underscore solved the problem.

See also:

Re: Spark-2.0.0 fails reading a parquet dataset generated by Spark-1.6.2

Solution 3

I'm using AWS Glue and I received this error while reading data from a data catalog table (location: s3 bucket). After bit of analysis I realised that, this is due to file not available in file location(in my case s3 bucket path).

Glue was trying to apply data catalog table schema on a file which doesn't exist.

After copying file into s3 bucket file location, issue got resolved.

Hope this helps someone who encounters/encountered an error in AWS Glue.

Solution 4

This case occurs when you try to read a table that is empty. If the table had correctly inserted data, there should be no problem.

Besides that with parquet, the same thing happens with ORC.

Solution 5

Just to emphasize @Davos answer in a comment, you will encounter this exact exception error, if your file name has a dot . or an underscore _ at start of the filename

val df = spark.read.format("csv").option("delimiter", "|").option("header", "false")
         .load("/Users/myuser/_HEADER_0")

org.apache.spark.sql.AnalysisException: 
Unable to infer schema for CSV. It must be specified manually.;

Solution is to rename the file and try again (e.g. _HEADER rename to HEADER)

View more solutions

126,249

Author by

user48956

Updated on July 09, 2022

Comments

user48956 almost 2 years

response = "mi_or_chd_5"

outcome = sqlc.sql("""select eid,{response} as response
from outcomes
where {response} IS NOT NULL""".format(response=response))
outcome.write.parquet(response, mode="overwrite") # Success
print outcome.schema
StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))

But then:

outcome2 = sqlc.read.parquet(response)  # fail

fails with:

AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'

/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw)

The documentation for parquet says the format is self describing, and the full schema was available when the parquet file was saved. What gives?

Using Spark 2.1.1. Also fails in 2.2.0.

Found this bug report, but was fixed in 2.0.1, 2.1.0.

UPDATE: This work when on connected with master="local", and fails when connected to master="mysparkcluster".

user48956 over 6 years

The dataframe is not empty. I believe the issue happens because the filename response can't be written to on the cluster. Works fine in local mode.
Javier Montón over 6 years

Then maybe you could try changing the username. In Pyspark: os.environ["HADOOP_USER_NAME"] = "hdfs" or in ScalaSystem.setProperty("HADOOP_USER_NAME","hdfs")
user48956 over 6 years

I'm not sure we're make use of hadoop. Is is a requirement for Spark and needs to be configured with user profiles when the spark cluster is installed? (All of our data is sourced from relational DBs and loaded into Spark on demand). In any case wouldn't I need to prefix the filename with "hdfs://". If use a filename such as "/my/nfs/network_directory/filename" saving works. Which also makes me think that the path refers to the worker-local filesystem. (sorry -- spark n00b)
Javier Montón over 6 years

Sorry I assumed you used Hadoop. You can run Spark in Local[], Standalone (cluster with Spark only) or YARN (cluster with Hadoop). If you're using YARN mode, by default all paths assumed you're using HDFS and it's not necessary put hdfs://, in fact if you want to use local files you should use file://If for example you are sending an aplication to the cluster from your computer, the application will use your username and probably it haven't access to HDFS files. With HADOOP_USER_NAME you can change it In Spark Standalone I don't know exactly how files and permissions work Hope this help!
user48956 over 6 years

I'll check, but as I remember, if I specified a valid path to a locally filesystem mounted network drive starting "/mnt/..., all was fine. When '/...' is specified, does it look to Hadoop:Yarn first and then local file system if not found?
Kumar Vaibhav almost 6 years

It's never a good practice to use isEmpty() method. Please avoid if you can - it 'can' bring the entire data into driver memory - refer RDD class code in Spark.
Sim over 5 years

Spark treats all files that begin with _ as metadata and not data.
Davos over 4 years

Also with AWS Glue, if the job bookmark filter results in there being no data and you attempt to write then it says "After final job bookmarks filter, processing 0.00% of 0 files in partition" which then leads to "Unable to infer schema for Parquet. It must be specified manually." because the frame being written is empty.
Davos over 4 years

"Spark 2.0 ignores the path names starting with underscore or dot; _ or . " as discussed by Spark developers here: issues.apache.org/jira/browse/…
Davos over 4 years

"Spark 2.0 ignores the path (file) names starting with underscore or dot; _ or . " as discussed by Spark developers here: issues.apache.org/jira/browse/…
user48956 over 4 years

Thanks. This was not my error. It think the error was the lack of a file system.
user48956 over 3 years

Yes. In retrospect, that may be obvious to someone who knows how to interpret Spark exception messages.