Write and run pyspark in IntelliJ IDEA

14,116

Solution 1

Set the env path for (SPARK_HOME and PYTHONPATH) in your program run/debug configuration.

For instance:

SPARK_HOME=/Users/<username>/javalibs/spark-1.5.0-bin-hadoop2.4/python/
PYTHON_PATH=/Users/<username>/javalibs/spark-1.5.0-bin-hadoop2.4/python/pyspark

See attached snapshot in IntelliJ Idea

Run/Debug configuration for PySpark

Solution 2

For example, something of this kind:

from pyspark import SparkContext, SparkConf
spark_conf = SparkConf().setAppName("scavenge some logs")
spark_context = SparkContext(conf=spark_conf)
address = "/path/to/the/log/on/hdfs/*.gz"
log = spark_context.textFile(address)

my_result = (log.

...here go your actions and transformations...

).saveAsTextFile('my_result')
Share:
14,116
Admin
Author by

Admin

Updated on June 05, 2022

Comments

  • Admin
    Admin almost 2 years

    i am trying to work with Pyspark in IntelliJ but i cannot figure out how to correctly install it/setup the project. I can work with Python in IntelliJ and I can use the pyspark shell but I cannot tell IntelliJ how to find the Spark files (import pyspark results in "ImportError: No module named pyspark").

    Any tipps on how to include/import spark so that IntelliJ can work with it are appreciated.

    Thanks.

    UPDATE:

    I tried this piece of code:

    from pyspark import SparkContext, SparkConf
    spark_conf = SparkConf().setAppName("scavenge some logs")
    spark_context = SparkContext(conf=spark_conf)
    address = "C:\test.txt"
    log = spark_context.textFile(address)
    
    my_result = log.filter(lambda x: 'foo' in x).saveAsTextFile('C:\my_result')
    

    with the following error messages:

    Traceback (most recent call last):
    File "C:/Users/U546816/IdeaProjects/sparktestC/.idea/sparktestfile", line 2, in <module>
    spark_conf = SparkConf().setAppName("scavenge some logs")
    File "C:\Users\U546816\Documents\Spark\lib\spark-assembly-1.3.1-hadoop2.4.0.jar\pyspark\conf.py", line 97, in __init__
    File "C:\Users\U546816\Documents\Spark\lib\spark-assembly-1.3.1-hadoop2.4.0.jar\pyspark\context.py", line 221, in _ensure_initialized
    File "C:\Users\U546816\Documents\Spark\lib\spark-assembly-1.3.1-hadoop2.4.0.jar\pyspark\java_gateway.py", line 35, in launch_gateway
    
    File "C:\Python27\lib\os.py", line 425, in __getitem__
    return self.data[key.upper()]
    KeyError: 'SPARK_HOME'
    
    Process finished with exit code 1
    
  • Chris Marotta
    Chris Marotta almost 7 years
    The variables are PYTHONPATH and SPARK_HOME, for those of us behind tyrannical firewalls.
  • Ramesh Maharjan
    Ramesh Maharjan over 6 years
    And the SPARK_HOME should include path untill the directory containing bin, python etc. and not until python.