Application report for application_ (state: ACCEPTED) never ends for Spark Submit (with Spark 1.2.0 on YARN)
Solution 1
I had this exact problem when multiple users were trying to run on our cluster at once. The fix was to change setting of the scheduler.
In the file /etc/hadoop/conf/capacity-scheduler.xml
we changed the property yarn.scheduler.capacity.maximum-am-resource-percent
from 0.1
to 0.5
.
Changing this setting increases the fraction of the resources that is made available to be allocated to application masters, increasing the number of masters possible to run at once and hence increasing the number of possible concurrent applications.
Solution 2
I got this error in this situation:
- MASTER=yarn (or yarn-client)
- spark-submit runs on a computer outside of the cluster and there is no route from the cluster to it because it's hidden by a router
Logs for container_1453825604297_0001_02_000001 (from ResourceManager web UI):
16/01/26 08:30:38 INFO yarn.ApplicationMaster: Waiting for Spark driver to be reachable. 16/01/26 08:31:41 ERROR yarn.ApplicationMaster: Failed to connect to driver at 192.168.1.180:33074, retrying ... 16/01/26 08:32:44 ERROR yarn.ApplicationMaster: Failed to connect to driver at 192.168.1.180:33074, retrying ... 16/01/26 08:32:45 ERROR yarn.ApplicationMaster: Uncaught exception: org.apache.spark.SparkException: Failed to connect to driver! at org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkDriver(ApplicationMaster.scala:484)
I workaround it by using yarn cluster mode: MASTER=yarn-cluster.
On another computer which is configured in the similar way, but is's IP is reachable from the cluster, both yarn-client and yarn-cluster work.
Others may encounter this error for different reasons, and my point is that checking error logs (not seen from terminal, but ResourceManager web UI in this case) almost always helps.
Solution 3
This suggests that YARN cannot assign resources for the new App you are submitting. Try to reduce the resources for the container you are asking for (see here), or try this on a less busy cluster.
Another thing to try is check if YARN works properly as a service:
sudo service hadoop-yarn-nodemanager status
sudo service hadoop-yarn-resourcemanager status
Solution 4
I had a small cluster where the resources were limited (~3GB per node). Solved this problem by changing the minimum memory allocation to a sufficiently low number.
From:
yarn.scheduler.minimum-allocation-mb: 1g
yarn.scheduler.increment-allocation-mb: 512m
To:
yarn.scheduler.minimum-allocation-mb: 256m
yarn.scheduler.increment-allocation-mb: 256m
Solution 5
I am on a slightly different setup using CDH 5.4. I think the cause of this issue on my setup is something getting stuck because of an error (file already exists, etc.), because this happens after some other part of my code errors out and a try to fix and kick it off again.
I can get past this by restarting all services on the cluster in cloudera manager, so I agree with earlier answers that it's probably due to resources that are allocated to something that error-ed out and you need to reclaim those resources to be able to run again, or allocate them differently to begin with.
e.g. my cluster has 4 executors available to it. In SparkConf for one process, I set spark.executor.instances to 4. While that process is still running, potentially hung up for some reason, I kick off another job (either same way, or with spark-submit), with spark.executor.instances set to 1 ("--num-executors 1 " if using spark-submit). I only have 4, and 4 are allocated to the earlier process, so this one which is asking for 1 executor has to wait in line.
Sam
Updated on July 23, 2022Comments
-
Sam almost 2 years
I am running kinesis plus spark application https://spark.apache.org/docs/1.2.0/streaming-kinesis-integration.html
I am running as below
command on ec2 instance :
./spark/bin/spark-submit --class org.apache.spark.examples.streaming.myclassname --master yarn-cluster --num-executors 2 --driver-memory 1g --executor-memory 1g --executor-cores 1 /home/hadoop/test.jar
I have installed spark on EMR.
EMR details Master instance group - 1 Running MASTER m1.medium 1 Core instance group - 2 Running CORE m1.medium
I am getting below INFO and it never ends.
15/06/14 11:33:23 INFO yarn.Client: Requesting a new application from cluster with 2 NodeManagers 15/06/14 11:33:23 INFO yarn.Client: Verifying our application has not requested more than the maximum memory capability of the cluster (2048 MB per container) 15/06/14 11:33:23 INFO yarn.Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead 15/06/14 11:33:23 INFO yarn.Client: Setting up container launch context for our AM 15/06/14 11:33:23 INFO yarn.Client: Preparing resources for our AM container 15/06/14 11:33:24 INFO yarn.Client: Uploading resource file:/home/hadoop/.versions/spark-1.3.1.e/lib/spark-assembly-1.3.1-hadoop2.4.0.jar -> hdfs://172.31.13.68:9000/user/hadoop/.sparkStaging/application_1434263747091_0023/spark-assembly-1.3.1-hadoop2.4.0.jar 15/06/14 11:33:29 INFO yarn.Client: Uploading resource file:/home/hadoop/test.jar -> hdfs://172.31.13.68:9000/user/hadoop/.sparkStaging/application_1434263747091_0023/test.jar 15/06/14 11:33:31 INFO yarn.Client: Setting up the launch environment for our AM container 15/06/14 11:33:31 INFO spark.SecurityManager: Changing view acls to: hadoop 15/06/14 11:33:31 INFO spark.SecurityManager: Changing modify acls to: hadoop 15/06/14 11:33:31 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); users with modify permissions: Set(hadoop) 15/06/14 11:33:31 INFO yarn.Client: Submitting application 23 to ResourceManager 15/06/14 11:33:31 INFO impl.YarnClientImpl: Submitted application application_1434263747091_0023 15/06/14 11:33:32 INFO yarn.Client: Application report for application_1434263747091_0023 (state: ACCEPTED) 15/06/14 11:33:32 INFO yarn.Client: client token: N/A diagnostics: N/A ApplicationMaster host: N/A ApplicationMaster RPC port: -1 queue: default start time: 1434281611893 final status: UNDEFINED tracking URL: http://172.31.13.68:9046/proxy/application_1434263747091_0023/ user: hadoop 15/06/14 11:33:33 INFO yarn.Client: Application report for application_1434263747091_0023 (state: ACCEPTED) 15/06/14 11:33:34 INFO yarn.Client: Application report for application_1434263747091_0023 (state: ACCEPTED) 15/06/14 11:33:35 INFO yarn.Client: Application report for application_1434263747091_0023 (state: ACCEPTED) 15/06/14 11:33:36 INFO yarn.Client: Application report for application_1434263747091_0023 (state: ACCEPTED) 15/06/14 11:33:37 INFO yarn.Client: Application report for application_1434263747091_0023 (state: ACCEPTED) 15/06/14 11:33:38 INFO yarn.Client: Application report for application_1434263747091_0023 (state: ACCEPTED) 15/06/14 11:33:39 INFO yarn.Client: Application report for application_1434263747091_0023 (state: ACCEPTED) 15/06/14 11:33:40 INFO yarn.Client: Application report for application_1434263747091_0023 (state: ACCEPTED) 15/06/14 11:33:41 INFO yarn.Client: Application report for application_1434263747091_0023 (state: ACCEPTED)
Could somebody please let me know as why it's not working ?
-
marios almost 9 yearsThis issue seem to persists even if the application has an end (non streaming batch job).
-
Michael over 8 yearsJust want to add that this is a very important parameter to set, if you are running cluster on a single machine where your resources are small
-
Abhi almost 8 yearsI had to start up my yarn-nodemngr. Thanks!
-
Hendrik F over 7 yearsIf you are running Yarn, you can use yarn application -kill <appilication-ID>. You can check the status of your jobs on port 8088, if running locally, direct your browser to localhost:8088. When I develop something locally, I sometimes have Hive sessions or Zeppelin jobs running as well and I need to kill those first before the Spark job executes.
-
tokland over 7 years
--conf spark.yarn.executor.memoryOverhead
. No value? -
Keith almost 7 yearssame issue as your case. should be a route table issue.
-
frb almost 7 yearsThanks for this explanation!
-
Admin over 6 yearsi am working in HUE, i also have same problem. where i have to change above code?
-
Admin over 6 yearsi am working in HUE, i also have same problem. where i can find capacity-scheduler.xml file
-
Sangram Gaikwad over 5 yearsworked for me. its possible to run multiple Spark applications now
-
user1 almost 4 years@user5227388 sudo find / -name capacity-scheduler.xml