importing pyspark in python shell

195,218

Solution 1

Turns out that the pyspark bin is LOADING python and automatically loading the correct library paths. Check out $SPARK_HOME/bin/pyspark :

export SPARK_HOME=/some/path/to/apache-spark
# Add the PySpark classes to the Python path:
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH

I added this line to my .bashrc file and the modules are now correctly found!

Solution 2

Assuming one of the following:

  • Spark is downloaded on your system and you have an environment variable SPARK_HOME pointing to it
  • You have ran pip install pyspark

Here is a simple method (If you don't bother about how it works!!!)

Use findspark

  1. Go to your python shell

    pip install findspark
    
    import findspark
    findspark.init()
    
  2. import the necessary modules

    from pyspark import SparkContext
    from pyspark import SparkConf
    
  3. Done!!!

Solution 3

If it prints such error:

ImportError: No module named py4j.java_gateway

Please add $SPARK_HOME/python/build to PYTHONPATH:

export SPARK_HOME=/Users/pzhang/apps/spark-1.1.0-bin-hadoop2.4
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH

Solution 4

Don't run your py file as: python filename.py instead use: spark-submit filename.py

Source: https://spark.apache.org/docs/latest/submitting-applications.html

Solution 5

By exporting the SPARK path and the Py4j path, it started to work:

export SPARK_HOME=/usr/local/Cellar/apache-spark/1.5.1
export PYTHONPATH=$SPARK_HOME/libexec/python:$SPARK_HOME/libexec/python/build:$PYTHONPATH
PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH 
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH

So, if you don't want to type these everytime you want to fire up the Python shell, you might want to add it to your .bashrc file

Share:
195,218
Glenn Strycker
Author by

Glenn Strycker

Ph.D. Physics 2010 Univ of Michigan. Currently works at ValueClick/Dotomi as a Decision Sciences Analyst.

Updated on January 25, 2022

Comments

  • Glenn Strycker
    Glenn Strycker about 2 years

    This is a copy of someone else's question on another forum that was never answered, so I thought I'd re-ask it here, as I have the same issue. (See http://geekple.com/blogs/feeds/Xgzu7/posts/351703064084736)

    I have Spark installed properly on my machine and am able to run python programs with the pyspark modules without error when using ./bin/pyspark as my python interpreter.

    However, when I attempt to run the regular Python shell, when I try to import pyspark modules I get this error:

    from pyspark import SparkContext
    

    and it says

    "No module named pyspark".
    

    How can I fix this? Is there an environment variable I need to set to point Python to the pyspark headers/libraries/etc.? If my spark installation is /spark/, which pyspark paths do I need to include? Or can pyspark programs only be run from the pyspark interpreter?

  • emmagras
    emmagras over 9 years
    In addition to this step, I also needed to add: export SPARK_HOME=~/dev/spark-1.1.0, go figure. Your foldernames may vary.
  • meyerson
    meyerson over 8 years
    As described in another response stackoverflow.com/questions/26533169/… I had to add the following export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTH‌​ONPATH
  • Alberto Bonsanto
    Alberto Bonsanto over 8 years
    I can't find the libexec directory in my Apache Spark installation, any idea?
  • Dawny33
    Dawny33 over 8 years
    @AlbertoBonsanto Sorry. I haven't faced this issue. So, no idea :(
  • bluerubez
    bluerubez over 8 years
    Yeah they took out the libexec folder in spark 1.5.2
  • OneCricketeer
    OneCricketeer almost 8 years
    @bluerubez Seems to be there in spark 1.6.2... Also, not sure what the libexec/python/build directory is for, but spark 1.6.2 doesn't have that
  • El Dude
    El Dude over 7 years
    Note - I tried unzipping it and use the py4j folder only, didn't work. Use the zip file...
  • Analytical Monk
    Analytical Monk over 7 years
    The other solutions didn't work for me. I am using findspark for now in my program. Seems like a decent workaround to the problem.
  • WestCoastProjects
    WestCoastProjects over 7 years
    I'd rather not need to do this .. but hey .. given nothing else works .. I'll take it.
  • Mint
    Mint about 5 years
    Can someone expand on why not to do this? I've been looking into this question but so far have not been able to find any that explain why that is.
  • kingledion
    kingledion over 4 years
    @Mint The other answers show why; the pyspark package is not included in the $PYTHONPATH by default, thus an import pyspark will fail at command line or in an executed script. You have to either a. run pyspark through spark-submit as intended or b. add $SPARK_HOME/python to $PYTHONPATH.
  • E.ZY.
    E.ZY. over 4 years
    Another point is spark-submit is a shell script, which helps you configure the system environment correctly before use spark, if you just do python main.py you need to configure the system environment correctly e.g. PYTHONPATH, SPARK_HOME