org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.lookupTimeout

23,030

You can keep increasing spark.network.timeout until you stop seeing the problem , as mentioned by himanshuIIITian in comment.
When spark is under heavy workload, timeout exception can occur. If you have low executor memory then GC may keep system very busy which increases workload. Look into the logs if there is Out Of Memory error. Please enable -XX:+PrintGCDetails -XX:+PrintGCTimeStamps in spark.executor.extraJavaOptions and look into logs if full GC is invoked a number of times before a task completes. If that is the case then increase your executorMemory . That should hopefully solve your problem.

Share:
23,030
Sparknewbie
Author by

Sparknewbie

Updated on December 11, 2020

Comments

  • Sparknewbie
    Sparknewbie over 3 years

    Getting the below error with respect to the container while submitting an spark application to YARN. The HADOOP(2.7.3)/SPARK (2.1) environment is running a pseudo-distributed mode in a single node cluster. The application works perfectly when made to run in local model however trying to check its correctness in a cluster mode using YARN as RM and hit some roadblock. New to this world hence looking for help.

    --- Applications logs

    2017-04-11 07:13:28 INFO  Client:58 - Submitting application 1 to ResourceManager
    2017-04-11 07:13:28 INFO  YarnClientImpl:174 - Submitted application application_1491909036583_0001 to ResourceManager at /0.0.0.0:8032
    2017-04-11 07:13:29 INFO  Client:58 - Application report for application_1491909036583_0001 (state: ACCEPTED)
    2017-04-11 07:13:29 INFO  Client:58 - 
         client token: N/A
         diagnostics: N/A
         ApplicationMaster host: N/A
         ApplicationMaster RPC port: -1
         queue: default
         start time: 1491909208425
         final status: UNDEFINED
         tracking URL: http://ip-xxx.xx.xx.xxx:8088/proxy/application_1491909036583_0001/
         user: xxxx
    2017-04-11 07:13:30 INFO  Client:58 - Application report for application_1491909036583_0001 (state: ACCEPTED)
    2017-04-11 07:13:31 INFO  Client:58 - Application report for application_1491909036583_0001 (state: ACCEPTED)
    2017-04-11 07:13:32 INFO  Client:58 - Application report for application_1491909036583_0001 (state: ACCEPTED)
    2017-04-11 07:17:37 INFO  Client:58 - Application report for application_1491909036583_0001 (state: FAILED)
    2017-04-11 07:17:37 INFO  Client:58 - 
         client token: N/A
         diagnostics: Application application_1491909036583_0001 failed 2 times due to AM Container for appattempt_1491909036583_0001_000002 exited with  exitCode: 10
    For more detailed output, check application tracking page:http://"hostname":8088/cluster/app/application_1491909036583_0001Then, click on links to logs of each attempt.
    Diagnostics: Exception from container-launch.
    Container id: container_1491909036583_0001_02_000001
    Exit code: 10
    Stack trace: ExitCodeException exitCode=10: 
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:582)
        at org.apache.hadoop.util.Shell.run(Shell.java:479)
        at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
        at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
        at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
    

    ****--- Container Logs****

    2017-04-11 07:13:30 INFO  ApplicationMaster:47 - Registered signal handlers for [TERM, HUP, INT]
    2017-04-11 07:13:31 INFO  ApplicationMaster:59 - ApplicationAttemptId: appattempt_1491909036583_0001_000001
    2017-04-11 07:13:32 INFO  SecurityManager:59 - Changing view acls to: root,xxxx
    2017-04-11 07:13:32 INFO  SecurityManager:59 - Changing modify acls to: root,xxxx
    2017-04-11 07:13:32 INFO  SecurityManager:59 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root, xxxx); users with modify permissions: Set(root, xxxx)
    2017-04-11 07:13:32 INFO  Slf4jLogger:80 - Slf4jLogger started
    2017-04-11 07:13:32 INFO  Remoting:74 - Starting remoting
    2017-04-11 07:13:32 INFO  Remoting:74 - Remoting started; listening on addresses :[akka.tcp://[email protected]:45446]
    2017-04-11 07:13:32 INFO  Remoting:74 - Remoting now listens on addresses: [akka.tcp://[email protected]:45446]
    2017-04-11 07:13:32 INFO  Utils:59 - Successfully started service 'sparkYarnAM' on port 45446.
    2017-04-11 07:13:32 INFO  ApplicationMaster:59 - Waiting for Spark driver to be reachable.
    2017-04-11 07:13:32 INFO  ApplicationMaster:59 - Driver now available: xxx.xx.xx.xxx:47503
    2017-04-11 07:15:32 ERROR ApplicationMaster:96 - Uncaught exception: 
    org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.lookupTimeout
        at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcEnv.scala:214)
        at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:229)
        at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcEnv.scala:225)
        at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
        at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcEnv.scala:242)
        at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:98)
        at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:116)
        at org.apache.spark.deploy.yarn.ApplicationMaster.runAMEndpoint(ApplicationMaster.scala:279)
        at org.apache.spark.deploy.yarn.ApplicationMaster.waitForSparkDriver(ApplicationMaster.scala:473)
        at org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:315)
        at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:157)
        at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:625)
        at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:69)
        at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:68)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
        at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:68)
        at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:623)
        at org.apache.spark.deploy.yarn.ExecutorLauncher$.main(ApplicationMaster.scala:646)
        at org.apache.spark.deploy.yarn.ExecutorLauncher.main(ApplicationMaster.scala)
    Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds]
        at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
        at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
        at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
        at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
        at scala.concurrent.Await$.result(package.scala:107)
        at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcEnv.scala:241)
        ... 16 more
    2017-04-11 07:15:32 INFO  ApplicationMaster:59 - Final app status: FAILED, exitCode: 10, (reason: Uncaught exception: org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.lookupTimeout)
    2017-04-11 07:15:32 INFO  ShutdownHookManager:59 - Shutdown hook called
    

    --Yarn Node Manager logs at the time of failure

    2017-04-11 07:15:18,728 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 30015 for container-id container_1491909036583_0001_01_000001: 201.6 MB of 1 GB physical memory used; 2.3 GB of 4 GB virtual memory used
    2017-04-11 07:15:21,735 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 30015 for container-id container_1491909036583_0001_01_000001: 201.6 MB of 1 GB physical memory used; 2.3 GB of 4 GB virtual memory used
    2017-04-11 07:15:24,742 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 30015 for container-id container_1491909036583_0001_01_000001: 201.6 MB of 1 GB physical memory used; 2.3 GB of 4 GB virtual memory used
    2017-04-11 07:15:27,749 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 30015 for container-id container_1491909036583_0001_01_000001: 201.6 MB of 1 GB physical memory used; 2.3 GB of 4 GB virtual memory used
    2017-04-11 07:15:30,756 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 30015 for container-id container_1491909036583_0001_01_000001: 201.6 MB of 1 GB physical memory used; 2.3 GB of 4 GB virtual memory used
    2017-04-11 07:15:33,018 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exit code from container container_1491909036583_0001_01_000001 is : 10
    2017-04-11 07:15:33,019 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: Exception from container-launch with container ID: container_1491909036583_0001_01_000001 and exit code: 10
    ExitCodeException exitCode=10: 
        at org.apache.hadoop.util.Shell.runCommand(Shell.java:582)
    

    -- SparkCOntext parameters

    <!-- Spark Configuration -->
    <bean id="sparkInfo" class="SparkInfo">
        <property name="appName" value="framework"></property>
        <property name="master" value="yarn-client"></property>
        <property name="dynamicAllocation" value="false"></property>
        <property name="executorInstances" value="2"></property>
        <property name="executorMemory" value="1g"></property>
        <property name="executorCores" value="4"></property>
        <property name="executorCoresMax" value="2"></property>
        <property name="taskCpus" value="4"></property>
        <property name="executorClassPath" value="/usr/hadoop/hadoop-2.7.3/share/hadoop/yarn/lib/*"></property>
        <property name="yarnJar"
            value="${framework.hdfsURI}/app/spark-1.5.0-bin-hadoop2.6/lib/spark-assembly-1.5.0-hadoop2.6.0.jar"></property>
        <property name="yarnQueue" value="default"></property>
        <property name="memoryFraction" value="0.4"></property>
    </bean>
    

    sparks.default.conf

    spark.driver.memory              1g
    spark.executor.extraJavaOptions   -XX:ReservedCodeCacheSize=100M -XX:MaxMetaspaceSize=256m -XX:CompressedClassSpaceSize=256m
    spark.rpc.lookupTimeout          600s
    

    yarn-site.xml

    <!-- Site specific YARN configuration properties -->
      <property>
            <name>yarn.nodemanager.aux-services</name>
            <value>mapreduce_shuffle</value>
      </property>
      <property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>1024</value>
      </property>
      <property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>3096</value>
      </property>
      <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>3096</value>
      </property>
      <property>
        <name>yarn.nodemanager.vmem-pmem-ratio</name>
        <value>4</value>
      </property>
    </configuration>
    
  • Sparknewbie
    Sparknewbie about 7 years
    the same spark application works correctly in Spark standalone with dynamic allocation enabled. It is only when i switch the master to yarn , i am facing this issue.
  • Sparknewbie
    Sparknewbie about 7 years
    Also i have updated the spark.network.timeout parameter to 600s but the exception still continues to say "Futures timed out after [120 seconds]". the only information i see is 2017-04-13 15:34:51,370 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerEx‌​ecutor: Exit code from container container_1492111885369_0001_01_000001 is : 10 2017-04-13 15:34:51,371 WARN org.apache.hadoop.yarn.server.nodemanager.DefaultContainerEx‌​ecutor: Exception from container-launch with container ID: container_1492111885369_0001_01_000001 and exit code: 10 ExitCodeException exitCode=10:
  • Sayantan Ghosh
    Sayantan Ghosh about 5 years
    How to check these config for an application form the Ambari UI?