Error (application returned with exitcode 1) when running Spark in yarn-cluster mode

18,756

I was finally able to fix the problem. What was happening was that my bash script that calls spark-submit was passing an invalid argument to it. When a job starts, a script called launch_container.sh would be executing org.apache.spark.deploy.yarn.ApplicationMaster with the arguments passed to spark-submit and the ApplicationMaster returns with an exit code of 1 when any argument to it is invalid.

More information here

Share:
18,756
thisisshantzz
Author by

thisisshantzz

Updated on June 30, 2022

Comments

  • thisisshantzz
    thisisshantzz almost 2 years

    I have a SPARK job that keeps returning with Exit Code 1 and I am not able to figure out what this particular exit code means and why is the application returning with this code. This is what I see in the Node Manager logs-

    2017-07-10 07:54:03,839 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_1499673023544_0001_01_000001 and exit code: 1
    ExitCodeException exitCode=1:
            at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
            at org.apache.hadoop.util.Shell.run(Shell.java:456)
            at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
            at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
            at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
            at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
            at java.util.concurrent.FutureTask.run(FutureTask.java:266)
            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
            at java.lang.Thread.run(Thread.java:745)
    2017-07-10 07:54:03,843 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exception from container-launch.
    2017-07-10 07:54:03,843 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Container id: container_1499673023544_0001_01_000001
    2017-07-10 07:54:03,843 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Exit code: 1
    2017-07-10 07:54:03,843 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor: Stack trace: ExitCodeException exitCode=1:
    2017-07-10 07:54:03,843 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
    2017-07-10 07:54:03,843 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at org.apache.hadoop.util.Shell.run(Shell.java:456)
    2017-07-10 07:54:03,843 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
    2017-07-10 07:54:03,843 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
    2017-07-10 07:54:03,843 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
    2017-07-10 07:54:03,843 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
    2017-07-10 07:54:03,843 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    2017-07-10 07:54:03,843 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    2017-07-10 07:54:03,843 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    2017-07-10 07:54:03,843 INFO org.apache.hadoop.yarn.server.nodemanager.ContainerExecutor:       at java.lang.Thread.run(Thread.java:745)
    2017-07-10 07:54:03,844 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Container exited with a non-zero exit code 1
    2017-07-10 07:54:03,846 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1499673023544_0001_01_000001 transitioned from RUNNING to EXITED_WITH_FAILURE
    2017-07-10 07:54:03,846 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch: Cleaning up container container_1499673023544_0001_01_000001
    

    When I checked the logs the particular application (and container), it didn't return any specific stack trace or error message. This is what I see in the container's log (stderr) when the job terminates.

    INFO impl.ContainerManagementProtocolProxy: Opening proxy : myplayground:52311
    17/07/10 07:54:02 INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down. myplayground:36322
    17/07/10 07:54:03 INFO cluster.YarnClusterSchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://sparkExecutor@myplayground:49562/user/Executor#509101946]) with ID 1
    17/07/10 07:54:03 INFO cluster.YarnClusterSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
    17/07/10 07:54:03 INFO cluster.YarnClusterScheduler: YarnClusterScheduler.postStartHook done
    17/07/10 07:54:03 ERROR yarn.ApplicationMaster: User application exited with status 1
    17/07/10 07:54:03 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 1, (reason: User application exited with status 1)
    17/07/10 07:54:03 INFO spark.SparkContext: Invoking stop() from shutdown hook
    17/07/10 07:54:03 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null}
    17/07/10 07:54:03 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
    17/07/10 07:54:03 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/api,null}
    17/07/10 07:54:03 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null}
    17/07/10 07:54:03 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null}
    17/07/10 07:54:03 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
    17/07/10 07:54:03 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null}
    17/07/10 07:54:03 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null}
    17/07/10 07:54:03 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null}
    17/07/10 07:54:03 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null}
    17/07/10 07:54:03 INFO ui.SparkUI: Stopped Spark web UI at http://x.x.x.x:37961
    17/07/10 07:54:03 INFO scheduler.DAGScheduler: Stopping DAGScheduler
    17/07/10 07:54:03 INFO cluster.YarnClusterSchedulerBackend: Shutting down all executors
    17/07/10 07:54:03 INFO cluster.YarnClusterSchedulerBackend: Asking each executor to shut down
    17/07/10 07:54:03 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
    17/07/10 07:54:03 INFO storage.MemoryStore: MemoryStore cleared
    17/07/10 07:54:03 INFO storage.BlockManager: BlockManager stopped
    17/07/10 07:54:03 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
    17/07/10 07:54:03 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
    17/07/10 07:54:03 INFO spark.SparkContext: Successfully stopped SparkContext
    17/07/10 07:54:03 INFO util.ShutdownHookManager: Shutdown hook called
    17/07/10 07:54:03 INFO util.ShutdownHookManager: Deleting directory /tmp/Hadoop-hadoop/nm-local-dir/usercache/myprdusr/appcache/application_1499673023544_0001/spark-2adeda9f-9244-4519-b87f-ec895a50cfcd
    17/07/10 07:54:03 INFO remote.RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
    17/07/10 07:54:03 INFO remote.RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
    

    So, in both the logs, all I get to see is that the application exited with exit code 1. Can anyone tell me what this particular error code means and the possible reasons why Yarn is throwing this exception?