importing pyspark in python shell
Solution 1
Turns out that the pyspark bin is LOADING python and automatically loading the correct library paths. Check out $SPARK_HOME/bin/pyspark
:
export SPARK_HOME=/some/path/to/apache-spark
# Add the PySpark classes to the Python path:
export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
I added this line to my .bashrc file and the modules are now correctly found!
Solution 2
Assuming one of the following:
- Spark is downloaded on your system and you have an environment variable
SPARK_HOME
pointing to it - You have ran
pip install pyspark
Here is a simple method (If you don't bother about how it works!!!)
Use findspark
-
Go to your python shell
pip install findspark import findspark findspark.init()
-
import the necessary modules
from pyspark import SparkContext from pyspark import SparkConf
-
Done!!!
Solution 3
If it prints such error:
ImportError: No module named py4j.java_gateway
Please add $SPARK_HOME/python/build to PYTHONPATH:
export SPARK_HOME=/Users/pzhang/apps/spark-1.1.0-bin-hadoop2.4
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
Solution 4
Don't run your py file as: python filename.py
instead use: spark-submit filename.py
Source: https://spark.apache.org/docs/latest/submitting-applications.html
Solution 5
By exporting the SPARK path and the Py4j path, it started to work:
export SPARK_HOME=/usr/local/Cellar/apache-spark/1.5.1
export PYTHONPATH=$SPARK_HOME/libexec/python:$SPARK_HOME/libexec/python/build:$PYTHONPATH
PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
So, if you don't want to type these everytime you want to fire up the Python shell, you might want to add it to your .bashrc
file
Glenn Strycker
Ph.D. Physics 2010 Univ of Michigan. Currently works at ValueClick/Dotomi as a Decision Sciences Analyst.
Updated on January 25, 2022Comments
-
Glenn Strycker about 2 years
This is a copy of someone else's question on another forum that was never answered, so I thought I'd re-ask it here, as I have the same issue. (See http://geekple.com/blogs/feeds/Xgzu7/posts/351703064084736)
I have Spark installed properly on my machine and am able to run python programs with the pyspark modules without error when using ./bin/pyspark as my python interpreter.
However, when I attempt to run the regular Python shell, when I try to import pyspark modules I get this error:
from pyspark import SparkContext
and it says
"No module named pyspark".
How can I fix this? Is there an environment variable I need to set to point Python to the pyspark headers/libraries/etc.? If my spark installation is /spark/, which pyspark paths do I need to include? Or can pyspark programs only be run from the pyspark interpreter?
-
emmagras over 9 yearsIn addition to this step, I also needed to add:
export SPARK_HOME=~/dev/spark-1.1.0
, go figure. Your foldernames may vary. -
meyerson over 8 yearsAs described in another response stackoverflow.com/questions/26533169/… I had to add the following export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
-
Alberto Bonsanto over 8 yearsI can't find the libexec directory in my
Apache Spark
installation, any idea? -
Dawny33 over 8 years@AlbertoBonsanto Sorry. I haven't faced this issue. So, no idea :(
-
bluerubez over 8 yearsYeah they took out the libexec folder in spark 1.5.2
-
OneCricketeer almost 8 years@bluerubez Seems to be there in spark 1.6.2... Also, not sure what the
libexec/python/build
directory is for, but spark 1.6.2 doesn't have that -
El Dude over 7 yearsNote - I tried unzipping it and use the
py4j
folder only, didn't work. Use the zip file... -
Analytical Monk over 7 yearsThe other solutions didn't work for me. I am using findspark for now in my program. Seems like a decent workaround to the problem.
-
WestCoastProjects over 7 yearsI'd rather not need to do this .. but hey .. given nothing else works .. I'll take it.
-
Mint about 5 yearsCan someone expand on why not to do this? I've been looking into this question but so far have not been able to find any that explain why that is.
-
kingledion over 4 years@Mint The other answers show why; the pyspark package is not included in the $PYTHONPATH by default, thus an
import pyspark
will fail at command line or in an executed script. You have to either a. run pyspark through spark-submit as intended or b. add $SPARK_HOME/python to $PYTHONPATH. -
E.ZY. over 4 yearsAnother point is spark-submit is a shell script, which helps you configure the system environment correctly before use spark, if you just do python main.py you need to configure the system environment correctly e.g. PYTHONPATH, SPARK_HOME