PySpark: java.lang.OutofMemoryError: Java heap space

82,368

Solution 1

After trying out loads of configuration parameters, I found that there is only one need to be changed to enable more Heap space and i.e. spark.driver.memory.

sudo vim $SPARK_HOME/conf/spark-defaults.conf
#uncomment the spark.driver.memory and change it according to your use. I changed it to below
spark.driver.memory 15g
# press : and then wq! to exit vim editor

Close your existing spark application and re run it. You will not encounter this error again. :)

Solution 2

If you're looking for the way to set this from within the script or a jupyter notebook, you can do:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .master('local[*]') \
    .config("spark.driver.memory", "15g") \
    .appName('my-cool-app') \
    .getOrCreate()

Solution 3

I had the same problem with pyspark (installed with brew). In my case it was installed on the path /usr/local/Cellar/apache-spark.

The only configuration file I had was in apache-spark/2.4.0/libexec/python//test_coverage/conf/spark-defaults.conf.

As suggested here I created the file spark-defaults.conf in the path /usr/local/Cellar/apache-spark/2.4.0/libexec/conf/spark-defaults.conf and appended to it the line spark.driver.memory 12g.

Share:
82,368
pg2455
Author by

pg2455

Updated on February 18, 2020

Comments

  • pg2455
    pg2455 over 4 years

    I have been using PySpark with Ipython lately on my server with 24 CPUs and 32GB RAM. Its running only on one machine. In my process, I want to collect huge amount of data as is give in below code:

    train_dataRDD = (train.map(lambda x:getTagsAndText(x))
    .filter(lambda x:x[-1]!=[])
    .flatMap(lambda (x,text,tags): [(tag,(x,text)) for tag in tags])
    .groupByKey()
    .mapValues(list))
    

    When I do

    training_data =  train_dataRDD.collectAsMap()
    

    It gives me outOfMemory Error. Java heap Space. Also, I can not perform any operations on Spark after this error as it looses connection with Java. It gives Py4JNetworkError: Cannot connect to the java server.

    It looks like heap space is small. How can I set it to bigger limits?

    EDIT:

    Things that I tried before running: sc._conf.set('spark.executor.memory','32g').set('spark.driver.memory','32g').set('spark.driver.maxResultsSize','0')

    I changed the spark options as per the documentation here(if you do ctrl-f and search for spark.executor.extraJavaOptions) : http://spark.apache.org/docs/1.2.1/configuration.html

    It says that I can avoid OOMs by setting spark.executor.memory option. I did the same thing but it seem not be working.