pyspark.sql.utils.AnalysisException: u'Path does not exist

10,116

The function parquet_example in the question would create a DataFrame from the parquet file test.parquet and query from it by creating a temporary view.

From the comments:
Since the Hive table named test already exists, directly query the table with the created SparkSession

warehouseLocation = "hdfs://user/hive/warehouse/"
spark = SparkSession \
    .builder \
    .appName("example") \
    .config("spark.sql.warehouse.dir", warehouseLocation) \
    .enableHiveSupport() \
    .getOrCreate()
spark.sql("SELECT * FROM test").show()
Share:
10,116
BubbleTree
Author by

BubbleTree

Updated on June 04, 2022

Comments

  • BubbleTree
    BubbleTree over 1 year

    I am running a spark job with amazon emr using the standard hdfs, not S3 to store my files. I have a hive table in hdfs://user/hive/warehouse/ but it cannot be found when my spark job is ran. I configured the spark property spark.sql.warehouse.dir to reflect that of my hdfs directory and while the yarn logs do say:

    17/03/28 19:54:05 INFO SharedState: Warehouse path is 'hdfs://user/hive/warehouse/'.
    

    later on in the logs it says(full log at end of page):

    LogType:stdout
    Log Upload Time:Tue Mar 28 19:54:15 +0000 2017
    LogLength:854
    Log Contents:
    Traceback (most recent call last):
      File "test.py", line 25, in <module>
        parquet_example(spark)
      File "test.py", line 9, in parquet_example
        tests = spark.read.parquet("test.parquet")
      File "/mnt/yarn/usercache/hadoop/appcache/application_1490717578939_0012/container_1490717578939_0012_01_000001/pyspark.zip/pyspark/sql/readwriter.py", line 274, in parquet
      File "/mnt/yarn/usercache/hadoop/appcache/application_1490717578939_0012/container_1490717578939_0012_01_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
      File "/mnt/yarn/usercache/hadoop/appcache/application_1490717578939_0012/container_1490717578939_0012_01_000001/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
    pyspark.sql.utils.AnalysisException: u'Path does not exist: hdfs://ip-xxx-xx-xx-xxx.ec2.internal:8020/user/hadoop/test.parquet;'
    End of LogType:stdout
    

    What am i doing wrong for there to be a mismatch in the path?

    Here is my hdfs directory for hive/warehouse:

    hdfs dfs -ls 
    
    /user/hive/warehouse
    Found 1 items
    drwxrwxrwt   - hadoop hadoop          0 2017-03-28 18:50 /user/hive/warehouse/test
    

    here is what /user/hadoop/ gives me:

    hdfs dfs -ls /user/hadoop/
    Found 2 items
    drwxr-xr-x   - hadoop hadoop          0 2017-03-28 16:53 /user/hadoop/.hiveJars
    drwxr-xr-x   - hadoop hadoop          0 2017-03-28 19:54 /user/hadoop/.sparkStaging
    

    And here is my spark job in python:

    from __future__ import print_function
    from pyspark.sql import SparkSession
    from pyspark.sql import Row
    
    def parquet_example(spark):
        tests = spark.read.parquet("test.parquet")
        tests.createOrReplaceTempView("tests")
        tests_result = spark.sql("SELECT * FROM test")
        tests_result.show()
    
    if __name__ == "__main__":
        warehouseLocation = "hdfs://user/hive/warehouse/"
        spark = SparkSession.builder.appName("example").config("spark.sql.warehouse.dir", warehouseLocation).enableHiveSupport().getOrCreate()
    
        parquet_example(spark)
        spark.stop()
    

    full yarn log:

    Container: container_1490717578939_0012_01_000001 on ip-xxx-xx-xx-xxx.ec2.internal_8041
    =========================================================================================
    LogType:stderr
    Log Upload Time:Tue Mar 28 19:54:15 +0000 2017
    LogLength:14054
    Log Contents:
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/mnt/yarn/usercache/hadoop/filecache/131/__spark_libs__713193244228500015.zip/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
    17/03/28 19:54:01 INFO SignalUtils: Registered signal handler for TERM
    17/03/28 19:54:01 INFO SignalUtils: Registered signal handler for HUP
    17/03/28 19:54:01 INFO SignalUtils: Registered signal handler for INT
    17/03/28 19:54:02 INFO ApplicationMaster: Preparing Local resources
    17/03/28 19:54:03 INFO ApplicationMaster: ApplicationAttemptId: appattempt_1490717578939_0012_000001
    17/03/28 19:54:03 INFO SecurityManager: Changing view acls to: yarn,hadoop
    17/03/28 19:54:03 INFO SecurityManager: Changing modify acls to: yarn,hadoop
    17/03/28 19:54:03 INFO SecurityManager: Changing view acls groups to: 
    17/03/28 19:54:03 INFO SecurityManager: Changing modify acls groups to: 
    17/03/28 19:54:03 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(yarn, hadoop); groups with view permissions: Set(); users  with modify permissions: Set(yarn, hadoop); groups with modify permissions: Set()
    17/03/28 19:54:03 INFO ApplicationMaster: Starting the user application in a separate Thread
    17/03/28 19:54:03 INFO ApplicationMaster: Waiting for spark context initialization...
    17/03/28 19:54:03 INFO SparkContext: Running Spark version 2.1.0
    17/03/28 19:54:03 INFO SecurityManager: Changing view acls to: yarn,hadoop
    17/03/28 19:54:03 INFO SecurityManager: Changing modify acls to: yarn,hadoop
    17/03/28 19:54:03 INFO SecurityManager: Changing view acls groups to: 
    17/03/28 19:54:03 INFO SecurityManager: Changing modify acls groups to: 
    17/03/28 19:54:03 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(yarn, hadoop); groups with view permissions: Set(); users  with modify permissions: Set(yarn, hadoop); groups with modify permissions: Set()
    17/03/28 19:54:03 INFO Utils: Successfully started service 'sparkDriver' on port 33579.
    17/03/28 19:54:04 INFO SparkEnv: Registering MapOutputTracker
    17/03/28 19:54:04 INFO SparkEnv: Registering BlockManagerMaster
    17/03/28 19:54:04 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
    17/03/28 19:54:04 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
    17/03/28 19:54:04 INFO DiskBlockManager: Created local directory at /mnt/yarn/usercache/hadoop/appcache/application_1490717578939_0012/blockmgr-f3713d64-91da-4cb5-9b55-d4a18c607a74
    17/03/28 19:54:04 INFO DiskBlockManager: Created local directory at /mnt1/yarn/usercache/hadoop/appcache/application_1490717578939_0012/blockmgr-634c7d4b-026c-4df7-abf4-7846bd7fc958
    17/03/28 19:54:04 INFO DiskBlockManager: Created local directory at /mnt2/yarn/usercache/hadoop/appcache/application_1490717578939_0012/blockmgr-19f0a265-755a-42f0-9282-1e3d98a57ab1
    17/03/28 19:54:04 INFO MemoryStore: MemoryStore started with capacity 414.4 MB
    17/03/28 19:54:04 INFO SparkEnv: Registering OutputCommitCoordinator
    17/03/28 19:54:04 INFO JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
    17/03/28 19:54:04 INFO Utils: Successfully started service 'SparkUI' on port 37056.
    17/03/28 19:54:04 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://xxx.xx.xx.xxx:37056
    17/03/28 19:54:04 INFO YarnClusterScheduler: Created YarnClusterScheduler
    17/03/28 19:54:04 INFO SchedulerExtensionServices: Starting Yarn extension services with app application_1490717578939_0012 and attemptId Some(appattempt_1490717578939_0012_000001)
    17/03/28 19:54:04 INFO Utils: Using initial executors = 0, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
    17/03/28 19:54:04 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 34414.
    17/03/28 19:54:04 INFO NettyBlockTransferService: Server created on xxx.xx.xx.xxx:34414
    17/03/28 19:54:04 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
    17/03/28 19:54:04 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, xxx.xx.xx.xxx, 34414, None)
    17/03/28 19:54:04 INFO BlockManagerMasterEndpoint: Registering block manager xxx.xx.xx.xxx:34414 with 414.4 MB RAM, BlockManagerId(driver, xxx.xx.xx.xxx, 34414, None)
    17/03/28 19:54:04 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, xxx.xx.xx.xxx, 34414, None)
    17/03/28 19:54:04 INFO BlockManager: external shuffle service port = 7337
    17/03/28 19:54:04 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, xxx.xx.xx.xxx, 34414, None)
    17/03/28 19:54:05 INFO EventLoggingListener: Logging events to hdfs:///var/log/spark/apps/application_1490717578939_0012_1
    17/03/28 19:54:05 INFO Utils: Using initial executors = 0, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
    17/03/28 19:54:05 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered!
    17/03/28 19:54:05 INFO YarnClusterSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
    17/03/28 19:54:05 INFO YarnClusterScheduler: YarnClusterScheduler.postStartHook done
    17/03/28 19:54:05 INFO YarnSchedulerBackend$YarnSchedulerEndpoint: ApplicationMaster registered as NettyRpcEndpointRef(spark://[email protected]:33579)
    17/03/28 19:54:05 INFO ApplicationMaster: 
    ===============================================================================
    YARN executor launch context:
      env:
        CLASSPATH -> /usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*<CPS>{{PWD}}<CPS>{{PWD}}/__spark_conf__<CPS>{{PWD}}/__spark_libs__/*<CPS>$HADOOP_CONF_DIR<CPS>$HADOOP_COMMON_HOME/*<CPS>$HADOOP_COMMON_HOME/lib/*<CPS>$HADOOP_HDFS_HOME/*<CPS>$HADOOP_HDFS_HOME/lib/*<CPS>$HADOOP_MAPRED_HOME/*<CPS>$HADOOP_MAPRED_HOME/lib/*<CPS>$HADOOP_YARN_HOME/*<CPS>$HADOOP_YARN_HOME/lib/*<CPS>/usr/lib/hadoop-lzo/lib/*<CPS>/usr/share/aws/emr/emrfs/conf<CPS>/usr/share/aws/emr/emrfs/lib/*<CPS>/usr/share/aws/emr/emrfs/auxlib/*<CPS>/usr/share/aws/emr/lib/*<CPS>/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar<CPS>/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar<CPS>/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar<CPS>/usr/lib/spark/yarn/lib/datanucleus-api-jdo.jar<CPS>/usr/lib/spark/yarn/lib/datanucleus-core.jar<CPS>/usr/lib/spark/yarn/lib/datanucleus-rdbms.jar<CPS>/usr/share/aws/emr/cloudwatch-sink/lib/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*<CPS>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*<CPS>/usr/lib/hadoop-lzo/lib/*<CPS>/usr/share/aws/emr/emrfs/conf<CPS>/usr/share/aws/emr/emrfs/lib/*<CPS>/usr/share/aws/emr/emrfs/auxlib/*<CPS>/usr/share/aws/emr/lib/*<CPS>/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar<CPS>/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar<CPS>/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar<CPS>/usr/share/aws/emr/cloudwatch-sink/lib/*
        SPARK_YARN_STAGING_DIR -> hdfs://ip-xxx-xx-xx-xxx.ec2.internal:8020/user/hadoop/.sparkStaging/application_1490717578939_0012
        SPARK_USER -> hadoop
        SPARK_YARN_MODE -> true
        PYTHONPATH -> {{PWD}}/pyspark.zip<CPS>{{PWD}}/py4j-0.10.4-src.zip
    
      command:
        LD_LIBRARY_PATH="/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:$LD_LIBRARY_PATH" \ 
          {{JAVA_HOME}}/bin/java \ 
          -server \ 
          -Xmx5120m \ 
          '-verbose:gc' \ 
          '-XX:+PrintGCDetails' \ 
          '-XX:+PrintGCDateStamps' \ 
          '-XX:+UseConcMarkSweepGC' \ 
          '-XX:CMSInitiatingOccupancyFraction=70' \ 
          '-XX:MaxHeapFreeRatio=70' \ 
          '-XX:+CMSClassUnloadingEnabled' \ 
          '-XX:OnOutOfMemoryError=kill -9 %p' \ 
          -Djava.io.tmpdir={{PWD}}/tmp \ 
          '-Dspark.history.ui.port=18080' \ 
          -Dspark.yarn.app.container.log.dir=<LOG_DIR> \ 
          org.apache.spark.executor.CoarseGrainedExecutorBackend \ 
          --driver-url \ 
          spark://[email protected]:33579 \ 
          --executor-id \ 
          <executorId> \ 
          --hostname \ 
          <hostname> \ 
          --cores \ 
          2 \ 
          --app-id \ 
          application_1490717578939_0012 \ 
          --user-class-path \ 
          file:$PWD/__app__.jar \ 
          1><LOG_DIR>/stdout \ 
          2><LOG_DIR>/stderr
    
      resources:
        py4j-0.10.4-src.zip -> resource { scheme: "hdfs" host: "ip-xxx-xx-xx-xxx.ec2.internal" port: 8020 file: "/user/hadoop/.sparkStaging/application_1490717578939_0012/py4j-0.10.4-src.zip" } size: 74096 timestamp: 1490730839170 type: FILE visibility: PRIVATE
        __spark_conf__ -> resource { scheme: "hdfs" host: "ip-xxx-xx-xx-xxx.ec2.internal" port: 8020 file: "/user/hadoop/.sparkStaging/application_1490717578939_0012/__spark_conf__.zip" } size: 75741 timestamp: 1490730839402 type: ARCHIVE visibility: PRIVATE
        pyspark.zip -> resource { scheme: "hdfs" host: "ip-xxx-xx-xx-xxx.ec2.internal" port: 8020 file: "/user/hadoop/.sparkStaging/application_1490717578939_0012/pyspark.zip" } size: 452353 timestamp: 1490730838849 type: FILE visibility: PRIVATE
        __spark_libs__ -> resource { scheme: "hdfs" host: "ip-xxx-xx-xx-xxx.ec2.internal" port: 8020 file: "/user/hadoop/.sparkStaging/application_1490717578939_0012/__spark_libs__713193244228500015.zip" } size: 196686961 timestamp: 1490730836856 type: ARCHIVE visibility: PRIVATE
        hive-site.xml -> resource { scheme: "hdfs" host: "ip-xxx-xx-xx-xxx.ec2.internal" port: 8020 file: "/user/hadoop/.sparkStaging/application_1490717578939_0012/hive-site.xml" } size: 2375 timestamp: 1490730837023 type: FILE visibility: PRIVATE
    
    ===============================================================================
    17/03/28 19:54:05 INFO RMProxy: Connecting to ResourceManager at ip-xxx-xx-xx-xxx.ec2.internal/xxx-xx-xx-xxx:8030
    17/03/28 19:54:05 INFO YarnRMClient: Registering the ApplicationMaster
    17/03/28 19:54:05 INFO SharedState: Warehouse path is 'hdfs://user/hive/warehouse/'.
    17/03/28 19:54:05 INFO Utils: Using initial executors = 0, max of spark.dynamicAllocation.initialExecutors, spark.dynamicAllocation.minExecutors and spark.executor.instances
    17/03/28 19:54:05 INFO ApplicationMaster: Started progress reporter thread with (heartbeat : 3000, initial allocation : 200) intervals
    17/03/28 19:54:05 INFO HiveUtils: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
    17/03/28 19:54:06 INFO metastore: Trying to connect to metastore with URI thrift://ip-xxx-xx-xx-xxx.ec2.internal:9083
    17/03/28 19:54:06 INFO metastore: Connected to metastore.
    17/03/28 19:54:06 INFO SessionState: Created local directory: /mnt/yarn/usercache/hadoop/appcache/application_1490717578939_0012/container_1490717578939_0012_01_000001/tmp/yarn
    17/03/28 19:54:06 INFO SessionState: Created local directory: /mnt/yarn/usercache/hadoop/appcache/application_1490717578939_0012/container_1490717578939_0012_01_000001/tmp/5f653144-e990-45b0-ba73-cdb4d10e9f7a_resources
    17/03/28 19:54:06 INFO SessionState: Created HDFS directory: /tmp/hive/hadoop/5f653144-e990-45b0-ba73-cdb4d10e9f7a
    17/03/28 19:54:06 INFO SessionState: Created local directory: /mnt/yarn/usercache/hadoop/appcache/application_1490717578939_0012/container_1490717578939_0012_01_000001/tmp/yarn/5f653144-e990-45b0-ba73-cdb4d10e9f7a
    17/03/28 19:54:06 INFO SessionState: Created HDFS directory: /tmp/hive/hadoop/5f653144-e990-45b0-ba73-cdb4d10e9f7a/_tmp_space.db
    17/03/28 19:54:06 INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.1) is hdfs://user/hive/warehouse/
    17/03/28 19:54:06 ERROR ApplicationMaster: User application exited with status 1
    17/03/28 19:54:06 INFO ApplicationMaster: Final app status: FAILED, exitCode: 1, (reason: User application exited with status 1)
    17/03/28 19:54:06 INFO SparkContext: Invoking stop() from shutdown hook
    17/03/28 19:54:06 INFO SparkUI: Stopped Spark web UI at http://xxx.xx.xx.xxx:37056
    17/03/28 19:54:06 INFO YarnClusterSchedulerBackend: Shutting down all executors
    17/03/28 19:54:06 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
    17/03/28 19:54:06 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
    (serviceOption=None,
     services=List(),
     started=false)
    17/03/28 19:54:06 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
    17/03/28 19:54:06 INFO MemoryStore: MemoryStore cleared
    17/03/28 19:54:06 INFO BlockManager: BlockManager stopped
    17/03/28 19:54:06 INFO BlockManagerMaster: BlockManagerMaster stopped
    17/03/28 19:54:06 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
    17/03/28 19:54:06 INFO SparkContext: Successfully stopped SparkContext
    17/03/28 19:54:06 INFO ShutdownHookManager: Shutdown hook called
    17/03/28 19:54:06 INFO ShutdownHookManager: Deleting directory /mnt1/yarn/usercache/hadoop/appcache/application_1490717578939_0012/spark-3a6db594-2b44-47fe-8e48-4220b93e789a
    17/03/28 19:54:06 INFO ShutdownHookManager: Deleting directory /mnt2/yarn/usercache/hadoop/appcache/application_1490717578939_0012/spark-a54516f0-48be-4fdb-899b-bbee998468b1
    17/03/28 19:54:06 INFO ShutdownHookManager: Deleting directory /mnt/yarn/usercache/hadoop/appcache/application_1490717578939_0012/spark-552e3cae-c119-47a5-9c63-34d4df59d072
    17/03/28 19:54:06 INFO ShutdownHookManager: Deleting directory /mnt/yarn/usercache/hadoop/appcache/application_1490717578939_0012/spark-552e3cae-c119-47a5-9c63-34d4df59d072/pyspark-a0240093-16c6-43e4-8f2c-dcef309afe97
    End of LogType:stderr
    
    LogType:stdout
    Log Upload Time:Tue Mar 28 19:54:15 +0000 2017
    LogLength:854
    Log Contents:
    Traceback (most recent call last):
      File "test.py", line 25, in <module>
        parquet_example(spark)
      File "test.py", line 9, in parquet_example
        tests = spark.read.parquet("test.parquet")
      File "/mnt/yarn/usercache/hadoop/appcache/application_1490717578939_0012/container_1490717578939_0012_01_000001/pyspark.zip/pyspark/sql/readwriter.py", line 274, in parquet
      File "/mnt/yarn/usercache/hadoop/appcache/application_1490717578939_0012/container_1490717578939_0012_01_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
      File "/mnt/yarn/usercache/hadoop/appcache/application_1490717578939_0012/container_1490717578939_0012_01_000001/pyspark.zip/pyspark/sql/utils.py", line 69, in deco
    pyspark.sql.utils.AnalysisException: u'Path does not exist: hdfs://ip-xxx-xx-xx-xxx.ec2.internal:8020/user/hadoop/test.parquet;'
    End of LogType:stdout