Pyspark: TaskMemoryManager: Failed to allocate a page: Need help in Error Analysis

21,532

In our case, we had lot of smaller tables (< 10 MB). So we decided to disable the broadcast and in addition to that started using G1GC for garbage collection. Add these entries to you spark-defaults.conf file in $SPARK_HOME/conf

spark.driver.extraJavaOptions -XX:+UseG1GC
spark.executor.extraJavaOptions  -XX:+UseG1GC
spark.sql.autoBroadcastJoinThreshold    -1

Or as an alternative you can adjust the threshold size for autoBroadcast and see it that solves the issue.

Share:
21,532
Satya
Author by

Satya

Trust Me,I want to be a programmer and still confused between whether i am already a one or still my status is in-progress. In both-way i like my status and preferably the "in-progress".

Updated on November 05, 2020

Comments

  • Satya
    Satya about 3 years

    I am facing these errors while running a spark job in standalone cluster mode.

    My spark job aims at:

    • Running some groupby,
    • count,
    • and joins to get a final df and then df.toPandas().to_csv().

    Input dataset is of 524 Mb. Error I get:

    WARN TaskMemoryManager: Failed to allocate a page (33554432 bytes), try again.

    After multiple times repeating the above , again new error

    1. WARN NettyRpcEnv: Ignored failure: java.util.concurrent.TimeoutException: Cannot receive any reply in 10 seconds

    2. org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [10 seconds]. This timeout is controlled by spark.executor.heartbeatInterval

    3. at org.apache.spark.rpc.RpcTimeout. org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException

    4. ExecutorLostFailure (executor driver exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 158295 ms

    5. Exception happened during processing of request from ('127.0.0.1', 49128) Traceback (most recent call last):

      File "/home/stp/spark-2.0.0-bin-hadoop2.7/python/pyspark/accumulators.py", line 235, in handle num_updates = read_int(self.rfile) File "/home/stp/spark-2.0.0-bin-hadoop2.7/python/pyspark/serializers.py", line 545, in read_int raise EOFError EOFError

    6. At last ###********##

      py4j.protocol.Py4JNetworkError: An error occurred while trying to connect to the Java server (127.0.0.1:38073)

    On first thought , i assumed, the error might be due to memory error(TaskMemoryManager) and from Total 16gb, the process was consuming max 6 gb, leaving 9+gb free. Also I had set the driver memory as 10G. so Pass.

    But, when I do a count() or show() on my final dataframe, it was a success op. But while doing toCsv, it is throwing the above errors/Warnings.

    Don't actually understand/guess what might be causing the issue.

    Please help me analyzing the above errors. Any help/comment is welcome. Thanks.