Application report for application_ (state: ACCEPTED) never ends for Spark Submit (with Spark 1.2.0 on YARN)

51,281

Solution 1

I had this exact problem when multiple users were trying to run on our cluster at once. The fix was to change setting of the scheduler.

In the file /etc/hadoop/conf/capacity-scheduler.xml we changed the property yarn.scheduler.capacity.maximum-am-resource-percent from 0.1 to 0.5.

Changing this setting increases the fraction of the resources that is made available to be allocated to application masters, increasing the number of masters possible to run at once and hence increasing the number of possible concurrent applications.

Solution 2

I got this error in this situation:

  1. MASTER=yarn (or yarn-client)
  2. spark-submit runs on a computer outside of the cluster and there is no route from the cluster to it because it's hidden by a router

Logs for container_1453825604297_0001_02_000001 (from ResourceManager web UI):

16/01/26 08:30:38 INFO yarn.ApplicationMaster: Waiting for Spark driver to be reachable.
16/01/26 08:31:41 ERROR yarn.ApplicationMaster: Failed to connect to driver at 192.168.1.180:33074, retrying ...
16/01/26 08:32:44 ERROR yarn.ApplicationMaster: Failed to connect to driver at 192.168.1.180:33074, retrying ...
16/01/26 08:32:45 ERROR yarn.ApplicationMaster: Uncaught exception: 
org.apache.spark.SparkException: Failed to connect to driver!
    at org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkDriver(ApplicationMaster.scala:484) 

I workaround it by using yarn cluster mode: MASTER=yarn-cluster.

On another computer which is configured in the similar way, but is's IP is reachable from the cluster, both yarn-client and yarn-cluster work.

Others may encounter this error for different reasons, and my point is that checking error logs (not seen from terminal, but ResourceManager web UI in this case) almost always helps.

Solution 3

This suggests that YARN cannot assign resources for the new App you are submitting. Try to reduce the resources for the container you are asking for (see here), or try this on a less busy cluster.

Another thing to try is check if YARN works properly as a service:

sudo service hadoop-yarn-nodemanager status
sudo service hadoop-yarn-resourcemanager status

Solution 4

I had a small cluster where the resources were limited (~3GB per node). Solved this problem by changing the minimum memory allocation to a sufficiently low number.

From:

yarn.scheduler.minimum-allocation-mb: 1g
yarn.scheduler.increment-allocation-mb: 512m

To:

yarn.scheduler.minimum-allocation-mb: 256m
yarn.scheduler.increment-allocation-mb: 256m

Solution 5

I am on a slightly different setup using CDH 5.4. I think the cause of this issue on my setup is something getting stuck because of an error (file already exists, etc.), because this happens after some other part of my code errors out and a try to fix and kick it off again.

I can get past this by restarting all services on the cluster in cloudera manager, so I agree with earlier answers that it's probably due to resources that are allocated to something that error-ed out and you need to reclaim those resources to be able to run again, or allocate them differently to begin with.

e.g. my cluster has 4 executors available to it. In SparkConf for one process, I set spark.executor.instances to 4. While that process is still running, potentially hung up for some reason, I kick off another job (either same way, or with spark-submit), with spark.executor.instances set to 1 ("--num-executors 1 " if using spark-submit). I only have 4, and 4 are allocated to the earlier process, so this one which is asking for 1 executor has to wait in line.

Share:
51,281
Sam
Author by

Sam

Updated on July 23, 2022

Comments

  • Sam
    Sam almost 2 years

    I am running kinesis plus spark application https://spark.apache.org/docs/1.2.0/streaming-kinesis-integration.html

    I am running as below

    command on ec2 instance :

     ./spark/bin/spark-submit --class org.apache.spark.examples.streaming.myclassname --master yarn-cluster --num-executors 2 --driver-memory 1g --executor-memory 1g --executor-cores 1  /home/hadoop/test.jar 
    

    I have installed spark on EMR.

    EMR details
    Master instance group - 1   Running MASTER  m1.medium   
    1
    
    Core instance group - 2 Running CORE    m1.medium
    

    I am getting below INFO and it never ends.

    15/06/14 11:33:23 INFO yarn.Client: Requesting a new application from cluster with 2 NodeManagers
    15/06/14 11:33:23 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (2048 MB per container)
    15/06/14 11:33:23 INFO yarn.Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead
    15/06/14 11:33:23 INFO yarn.Client: Setting up container launch context for our AM
    15/06/14 11:33:23 INFO yarn.Client: Preparing resources for our AM container
    15/06/14 11:33:24 INFO yarn.Client: Uploading resource file:/home/hadoop/.versions/spark-1.3.1.e/lib/spark-assembly-1.3.1-hadoop2.4.0.jar -> hdfs://172.31.13.68:9000/user/hadoop/.sparkStaging/application_1434263747091_0023/spark-assembly-1.3.1-hadoop2.4.0.jar
    15/06/14 11:33:29 INFO yarn.Client: Uploading resource file:/home/hadoop/test.jar -> hdfs://172.31.13.68:9000/user/hadoop/.sparkStaging/application_1434263747091_0023/test.jar
    15/06/14 11:33:31 INFO yarn.Client: Setting up the launch environment for our AM container
    15/06/14 11:33:31 INFO spark.SecurityManager: Changing view acls to: hadoop
    15/06/14 11:33:31 INFO spark.SecurityManager: Changing modify acls to: hadoop
    15/06/14 11:33:31 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop)
    15/06/14 11:33:31 INFO yarn.Client: Submitting application 23 to ResourceManager
    15/06/14 11:33:31 INFO impl.YarnClientImpl: Submitted application application_1434263747091_0023
    15/06/14 11:33:32 INFO yarn.Client: Application report for application_1434263747091_0023 (state: ACCEPTED)
    15/06/14 11:33:32 INFO yarn.Client:
             client token: N/A
             diagnostics: N/A
             ApplicationMaster host: N/A
             ApplicationMaster RPC port: -1
             queue: default
             start time: 1434281611893
             final status: UNDEFINED
             tracking URL: http://172.31.13.68:9046/proxy/application_1434263747091_0023/
             user: hadoop
    15/06/14 11:33:33 INFO yarn.Client: Application report for application_1434263747091_0023 (state: ACCEPTED)
    15/06/14 11:33:34 INFO yarn.Client: Application report for application_1434263747091_0023 (state: ACCEPTED)
    15/06/14 11:33:35 INFO yarn.Client: Application report for application_1434263747091_0023 (state: ACCEPTED)
    15/06/14 11:33:36 INFO yarn.Client: Application report for application_1434263747091_0023 (state: ACCEPTED)
    15/06/14 11:33:37 INFO yarn.Client: Application report for application_1434263747091_0023 (state: ACCEPTED)
    15/06/14 11:33:38 INFO yarn.Client: Application report for application_1434263747091_0023 (state: ACCEPTED)
    15/06/14 11:33:39 INFO yarn.Client: Application report for application_1434263747091_0023 (state: ACCEPTED)
    15/06/14 11:33:40 INFO yarn.Client: Application report for application_1434263747091_0023 (state: ACCEPTED)
    15/06/14 11:33:41 INFO yarn.Client: Application report for application_1434263747091_0023 (state: ACCEPTED)
    

    Could somebody please let me know as why it's not working ?

  • marios
    marios almost 9 years
    This issue seem to persists even if the application has an end (non streaming batch job).
  • Michael
    Michael over 8 years
    Just want to add that this is a very important parameter to set, if you are running cluster on a single machine where your resources are small
  • Abhi
    Abhi almost 8 years
    I had to start up my yarn-nodemngr. Thanks!
  • Hendrik F
    Hendrik F over 7 years
    If you are running Yarn, you can use yarn application -kill <appilication-ID>. You can check the status of your jobs on port 8088, if running locally, direct your browser to localhost:8088. When I develop something locally, I sometimes have Hive sessions or Zeppelin jobs running as well and I need to kill those first before the Spark job executes.
  • tokland
    tokland over 7 years
    --conf spark.yarn.executor.memoryOverhead. No value?
  • Keith
    Keith almost 7 years
    same issue as your case. should be a route table issue.
  • frb
    frb almost 7 years
    Thanks for this explanation!
  • Admin
    Admin over 6 years
    i am working in HUE, i also have same problem. where i have to change above code?
  • Admin
    Admin over 6 years
    i am working in HUE, i also have same problem. where i can find capacity-scheduler.xml file
  • Sangram Gaikwad
    Sangram Gaikwad over 5 years
    worked for me. its possible to run multiple Spark applications now
  • user1
    user1 almost 4 years
    @user5227388 sudo find / -name capacity-scheduler.xml