Create Spark DataFrame from Pandas DataFrame
Import and initialise findspark, create a spark session and then use the object to convert the pandas data frame to a spark data frame. Then add the new spark data frame to the catalogue. Tested and runs in both Jupiter 5.7.2 and Spyder 3.3.2 with python 3.6.6.
import findspark
findspark.init()
import pyspark
from pyspark.sql import SparkSession
import pandas as pd
# Create a spark session
spark = SparkSession.builder.getOrCreate()
# Create pandas data frame and convert it to a spark data frame
pandas_df = pd.DataFrame({"Letters":["X", "Y", "Z"]})
spark_df = spark.createDataFrame(pandas_df)
# Add the spark data frame to the catalog
spark_df.createOrReplaceTempView('spark_df')
spark_df.show()
+-------+
|Letters|
+-------+
| X|
| Y|
| Z|
+-------+
spark.catalog.listTables()
Out[18]: [Table(name='spark_df', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]
roldanx
Updated on July 18, 2022Comments
-
roldanx almost 2 years
I'm trying to build a Spark DataFrame from a simple Pandas DataFrame. This are the steps I follow.
import pandas as pd pandas_df = pd.DataFrame({"Letters":["X", "Y", "Z"]}) spark_df = sqlContext.createDataFrame(pandas_df) spark_df.printSchema()
Till' this point everything is OK. The output is:
root
|-- Letters: string (nullable = true)The problem comes when I try to print the DataFrame:
spark_df.show()
This is the result:
An error occurred while calling o158.collectToPython. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 1 times, most recent failure: Lost task 0.0 in stage 5.0 (TID 5, localhost, executor driver): org.apache.spark.SparkException:
Error from python worker:
Error executing Jupyter command 'pyspark.daemon': [Errno 2] No such file or directory PYTHONPATH was:
/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/python/lib/pyspark.zip:/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip:/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/jars/spark-core_2.11-2.4.0.jar:/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip:/home/roldanx/soft/spark-2.4.0-bin-hadoop2.7/python/: org.apache.spark.SparkException: No port number in pyspark.daemon's stdoutThese are my Spark specifications:
SparkSession - hive
SparkContext
Spark UI
Version: v2.4.0
Master: local[*]
AppName: PySparkShell
This are my venv:
export PYSPARK_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='lab'
Fact:
As the error mentions, it has to do with running pyspark from Jupyter. Running it with 'PYSPARK_PYTHON=python2.7' and 'PYSPARK_PYTHON=python3.6' works fine