Unable to infer schema when loading Parquet file
Solution 1
This error usually occurs when you try to read an empty directory as parquet. Probably your outcome Dataframe is empty.
You could check if the DataFrame is empty with outcome.rdd.isEmpty()
before writing it.
Solution 2
In my case, the error occured because I was trying to read a parquet file which started with an underscore (e.g. _lots_of_data.parquet
). Not sure why this was an issue, but removing the leading underscore solved the problem.
See also:
Solution 3
I'm using AWS Glue and I received this error while reading data from a data catalog table (location: s3 bucket). After bit of analysis I realised that, this is due to file not available in file location(in my case s3 bucket path).
Glue was trying to apply data catalog table schema on a file which doesn't exist.
After copying file into s3 bucket file location, issue got resolved.
Hope this helps someone who encounters/encountered an error in AWS Glue.
Solution 4
This case occurs when you try to read a table that is empty. If the table had correctly inserted data, there should be no problem.
Besides that with parquet, the same thing happens with ORC.
Solution 5
Just to emphasize @Davos answer in a comment, you will encounter this exact exception error, if your file name has a dot .
or an underscore _
at start of the filename
val df = spark.read.format("csv").option("delimiter", "|").option("header", "false")
.load("/Users/myuser/_HEADER_0")
org.apache.spark.sql.AnalysisException:
Unable to infer schema for CSV. It must be specified manually.;
Solution is to rename the file and try again (e.g. _HEADER
rename to HEADER
)
user48956
Updated on July 09, 2022Comments
-
user48956 almost 2 years
response = "mi_or_chd_5" outcome = sqlc.sql("""select eid,{response} as response from outcomes where {response} IS NOT NULL""".format(response=response)) outcome.write.parquet(response, mode="overwrite") # Success print outcome.schema StructType(List(StructField(eid,IntegerType,true),StructField(response,ShortType,true)))
But then:
outcome2 = sqlc.read.parquet(response) # fail
fails with:
AnalysisException: u'Unable to infer schema for Parquet. It must be specified manually.;'
in
/usr/local/lib/python2.7/dist-packages/pyspark-2.1.0+hadoop2.7-py2.7.egg/pyspark/sql/utils.pyc in deco(*a, **kw)
The documentation for parquet says the format is self describing, and the full schema was available when the parquet file was saved. What gives?
Using Spark 2.1.1. Also fails in 2.2.0.
Found this bug report, but was fixed in 2.0.1, 2.1.0.
UPDATE: This work when on connected with master="local", and fails when connected to master="mysparkcluster".
-
user48956 over 6 yearsThe dataframe is not empty. I believe the issue happens because the filename
response
can't be written to on the cluster. Works fine in local mode. -
Javier Montón over 6 yearsThen maybe you could try changing the username. In Pyspark:
os.environ["HADOOP_USER_NAME"] = "hdfs"
or in ScalaSystem.setProperty("HADOOP_USER_NAME","hdfs")
-
user48956 over 6 yearsI'm not sure we're make use of hadoop. Is is a requirement for Spark and needs to be configured with user profiles when the spark cluster is installed? (All of our data is sourced from relational DBs and loaded into Spark on demand). In any case wouldn't I need to prefix the filename with "hdfs://". If use a filename such as "/my/nfs/network_directory/filename" saving works. Which also makes me think that the path refers to the worker-local filesystem. (sorry -- spark n00b)
-
Javier Montón over 6 yearsSorry I assumed you used Hadoop. You can run Spark in Local[], Standalone (cluster with Spark only) or YARN (cluster with Hadoop). If you're using YARN mode, by default all paths assumed you're using HDFS and it's not necessary put
hdfs://
, in fact if you want to use local files you should usefile://
If for example you are sending an aplication to the cluster from your computer, the application will use your username and probably it haven't access to HDFS files. With HADOOP_USER_NAME you can change it In Spark Standalone I don't know exactly how files and permissions work Hope this help! -
user48956 over 6 yearsI'll check, but as I remember, if I specified a valid path to a locally filesystem mounted network drive starting "/mnt/..., all was fine. When '/...' is specified, does it look to Hadoop:Yarn first and then local file system if not found?
-
Kumar Vaibhav almost 6 yearsIt's never a good practice to use isEmpty() method. Please avoid if you can - it 'can' bring the entire data into driver memory - refer RDD class code in Spark.
-
Sim over 5 yearsSpark treats all files that begin with
_
as metadata and not data. -
Davos over 4 yearsAlso with AWS Glue, if the job bookmark filter results in there being no data and you attempt to write then it says "After final job bookmarks filter, processing 0.00% of 0 files in partition" which then leads to "Unable to infer schema for Parquet. It must be specified manually." because the frame being written is empty.
-
Davos over 4 years"Spark 2.0 ignores the path names starting with underscore or dot;
_
or.
" as discussed by Spark developers here: issues.apache.org/jira/browse/… -
Davos over 4 years"Spark 2.0 ignores the path (file) names starting with underscore or dot;
_
or.
" as discussed by Spark developers here: issues.apache.org/jira/browse/… -
user48956 over 4 yearsThanks. This was not my error. It think the error was the lack of a file system.
-
user48956 over 3 yearsYes. In retrospect, that may be obvious to someone who knows how to interpret Spark exception messages.