Py4JJavaError: An error occurred while calling

35,729

This is a current issue with pyspark 2.4.0 installed via conda. You'll want to downgrade to pyspark 2.3.0 via conda prompt or Linux terminal:

    conda install pyspark=2.3.0
Share:
35,729
TChi
Author by

TChi

Updated on July 27, 2022

Comments

  • TChi
    TChi almost 2 years

    I am new to PySpark. I have been writing my code with a test sample. Once I run the code on the larger file(3gb compressed). My code is only doing some filtering and joins. I keep getting errors regarding py4J.

    Any help would be useful, and appreciated.

    from pyspark.sql import SparkSession
    from pyspark.conf import SparkConf
    
    ss = SparkSession \
          .builder \
          .appName("Example") \
          .getOrCreate()
    
    ss.conf.set("spark.sql.execution.arrow.enabled", 'true')
    
    df = ss.read.csv(directory + '/' + filename, header=True, sep=",")
    # Some filtering and groupbys...
    df.show()
    

    Return

    Py4JJavaError: An error occurred while calling o88.showString.
    : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
    stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 
    1, localhost, executor driver): java.lang.OutOfMemoryError: Java heap space
    ...
    Caused by: java.lang.OutOfMemoryError: Java heap space
    

    UPDATE: I was using py4j 10.7 and just updated to 10.8

    UPDATE(1): Adding spark.driver.memory:

     ss = SparkSession \
      .builder \
      .appName("Example") \
      .config("spark.driver.memory", "16g")\
      .getOrCreate()
    

    Summarized Return Error:

    ERROR:py4j.java_gateway:An error occurred while trying to connect to the Java server (127.0.0.1:38004)
    
    py4j.protocol.Py4JNetworkError: Answer from Java side is empty
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
    py4j.protocol.Py4JNetworkError: Error while receiving
    
    Py4JError
    Py4JError: An error occurred while calling o94.showString
    

    UPDATE(2) : I tried this, by changing the spark-defaults.conf file. Still getting error PySpark: java.lang.OutofMemoryError: Java heap space

    SEMI-SOLVED : This seemed to be a general memory problem. I started a 2xlarge instance with 32g of memory. The program runs with no errors.

    Knowing this, is there something else, a conf option that could help so I don't have to run an expensive instance?

    Thanks Everyone.