Spark Catalog w/ AWS Glue: database not found
Solution 1
Adding the hive.metastore.client.factory.class
configuration to the code initiating the spark session solved the issue for me:
SparkSession spark = SparkSession.builder()
...
.config("hive.metastore.client.factory.class", "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory")
.enableHiveSupport()
.getOrCreate();
that's the same configuration defined in aws docs (https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-glue.html) and added to the cluster configuration when checking Use for Hive table metadata
on cluster creation, but for some reason dosn't work as expected (I'm using emr 5.12.0).
Solution 2
I had the same issue: spark-submit
will not discover the AWS Glue libraries, but spark-shell
working on the master node will.
It turns out that my spark-submit
job uses a fat .jar
which was compiled with the standard org.apache.spark
and org.apache.hive
libraries. The jar libraries were being used in stead of the custom classes installed on EMR
.
If this is the case with you, make sure to exclude all:
'org.apache.spark:' 'org.apache.hive:' 'org.apache.hadoop:' modules from you
.jar
Here is the reference I used for .Gradle
: http://unethicalblogger.com/2015/07/15/gradle-goodness-excluding-depends-from-shadow.html.
Adding compileOnly
keyword in front of all spark libraries fixed it.
Solution 3
Our issue was IAM permissions on the EMR cluster; make sure that the cluster IAM instance profile has full access to glue.
Related videos on Youtube
autodidacticon
Updated on June 04, 2022Comments
-
autodidacticon almost 2 years
Ive created an EMR cluster with the Glue Data catalog. When I invoke the spark-shell, I am able to successfully list tables stored within a Glue database via
spark.catalog.setCurrentDatabase("test") spark.catalog.listTables
However when I submit a job via
spark-submit
I get a fatal errorERROR ApplicationMaster: User class threw exception: org.apache.spark.sql.AnalysisException: Database 'test' does not exist.;
I am creating my SparkSession within the job being submitted via
spark-submit
viaSparkSession.builder.enableHiveSupport.getOrCreate