How should I integrate Jupyter notebook and pyspark on Ubuntu 12.04?
Solution 1
Just run the command:
PYSPARK_DRIVER_PYTHON="jupyter" PYSPARK_DRIVER_PYTHON_OPTS="notebook" pyspark
Solution 2
Add to pyspark the two lines using nano or vim:
PYSPARK_DRIVER_PYTHON="jupyter"
PYSPARK_DRIVER_PYTHON_OPTS="notebook"
Solution 3
EDIT 2017-Oct
With Spark 2.2 findspark this works well, no need to those env vars
import findspark
findspark.init('/opt/spark')
import pyspark
sc = pyspark.SparkContext()
OLD
The fastest way I found was to run:
export PYSPARK_DRIVER=ipython
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
pyspark
Or equivalent for jupyter. This should open an ipython notebook with pyspark enabled. You might also want to look at Beaker notebook.
Wanderer
Updated on July 15, 2022Comments
-
Wanderer over 1 year
I am new for Pyspark. I installed "bash Anaconda2-4.0.0-Linux-x86_64.sh" on ubuntu. Also installed pyspark. Everything working fine in terminal. I want to work it on jupyter. When I created the profile file in my ubuntu terminal as follows:
wanderer@wanderer-VirtualBox:~$ ipython profile create pyspark [ProfileCreate] Generating default config file: u'/home/wanderer/.ipython/profile_pyspark/ipython_config.py' [ProfileCreate] Generating default config file: u'/home/wanderer/.ipython/profile_pyspark/ipython_kernel_config.py' wanderer@wanderer-VirtualBox:~$ export ANACONDA_ROOT=~/anaconda2 wanderer@wanderer-VirtualBox:~$ export PYSPARK_DRIVER_PYTHON=$ANACONDA_ROOT/bin/ipython wanderer@wanderer-VirtualBox:~$ export PYSPARK_PYTHON=$ANACONDA_ROOT/bin/python wanderer@wanderer-VirtualBox:~$ cd spark-1.5.2-bin-hadoop2.6/ wanderer@wanderer-VirtualBox:~/spark-1.5.2-bin-hadoop2.6$ PYTHON_OPTS=”notebook” ./bin/pyspark Python 2.7.11 |Anaconda 4.0.0 (64-bit)| (default, Dec 6 2015, 18:08:32) Type "copyright", "credits" or "license" for more information. IPython 4.1.2 -- An enhanced Interactive Python. ? -> Introduction and overview of IPython's features. %quickref -> Quick reference. help -> Python's own help system. object? -> Details about 'object', use 'object??' for extra details. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 16/04/24 15:27:42 INFO SparkContext: Running Spark version 1.5.2 16/04/24 15:27:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 16/04/24 15:27:53 INFO BlockManagerMasterEndpoint: Registering block manager localhost:33514 with 530.3 MB RAM, BlockManagerId(driver, localhost, 33514) 16/04/24 15:27:53 INFO BlockManagerMaster: Registered BlockManager Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 1.5.2 /_/ Using Python version 2.7.11 (default, Dec 6 2015 18:08:32) SparkContext available as sc, HiveContext available as sqlContext. In [1]: sc Out[1]: <pyspark.context.SparkContext at 0x7fc96cc6fd10> In [2]: print sc.version 1.5.2 In [3]:
Below are the versions of jupyter and ipython
wanderer@wanderer-VirtualBox:~$ jupyter --version 4.1.0 wanderer@wanderer-VirtualBox:~$ ipython --version 4.1.2
I tried to integrate jupyter notebook and pyspark, but every thing failed. I want to workout in jupyter and do not have any idea how to integrate jupyter notebook and pyspark.
Can anyone show how to integrate the above components?
-
citynorman over 7 yearsEasier still, run in command line:
IPYTHON_OPTS="notebook" $SPARK_HOME/bin/pyspark
. Found here -
Neal almost 7 years
IPYTHON_OPTS="notebook" $SPARK_HOME/bin/pyspark
appears to be removed in Spark 2.0+