What is the Spark DataFrame method `toPandas` actually doing?

73,242

Solution 1

Using spark to read in a CSV file to pandas is quite a roundabout method for achieving the end goal of reading a CSV file into memory.

It seems like you might be misunderstanding the use cases of the technologies in play here.

Spark is for distributed computing (though it can be used locally). It's generally far too heavyweight to be used for simply reading in a CSV file.

In your example, the sc.textFile method will simply give you a spark RDD that is effectively a list of text lines. This likely isn't what you want. No type inference will be performed, so if you want to sum a column of numbers in your CSV file, you won't be able to because they are still strings as far as Spark is concerned.

Just use pandas.read_csv and read the whole CSV into memory. Pandas will automatically infer the type of each column. Spark doesn't do this.

Now to answer your questions:

Does it store the Pandas object to local memory:

Yes. toPandas() will convert the Spark DataFrame into a Pandas DataFrame, which is of course in memory.

Does Pandas low-level computation handled all by Spark

No. Pandas runs its own computations, there's no interplay between spark and pandas, there's simply some API compatibility.

Does it exposed all pandas dataframe functionality?

No. For example, Series objects have an interpolate method which isn't available in PySpark Column objects. There are many many methods and functions that are in the pandas API that are not in the PySpark API.

Can I convert it toPandas and just be done with it, without so much touching DataFrame API?

Absolutely. In fact, you probably shouldn't even use Spark at all in this case. pandas.read_csv will likely handle your use case unless you're working with a huge amount of data.

Try to solve your problem with simple, low-tech, easy-to-understand libraries, and only go to something more complicated as you need it. Many times, you won't need the more complex technology.

Solution 2

Using some spark context or hive context method (sc.textFile(), hc.sql()) to read data 'into memory' returns an RDD, but the RDD remains in distributed memory (memory on the worker nodes), not memory on the master node. All the RDD methods (rdd.map(), rdd.reduceByKey(), etc) are designed to run in parallel on the worker nodes, with some exceptions. For instance, if you run a rdd.collect() method, you end up copying the contents of the rdd from all the worker nodes to the master node memory. Thus you lose your distributed compute benefits (but can still run the rdd methods).

Similarly with pandas, when you run toPandas(), you copy the data frame from distributed (worker) memory to the local (master) memory and lose most of your distributed compute capabilities. So, one possible workflow (that I often use) might be to pre-munge your data into a reasonable size using distributed compute methods and then convert to a Pandas data frame for the rich feature set. Hope that helps.

Share:
73,242
Napitupulu Jon
Author by

Napitupulu Jon

Updated on July 09, 2022

Comments

  • Napitupulu Jon
    Napitupulu Jon almost 2 years

    I'm a beginner of Spark-DataFrame API.

    I use this code to load csv tab-separated into Spark Dataframe

    lines = sc.textFile('tail5.csv')
    parts = lines.map(lambda l : l.strip().split('\t'))
    fnames = *some name list*
    schemaData = StructType([StructField(fname, StringType(), True) for fname in fnames])
    ddf = sqlContext.createDataFrame(parts,schemaData)
    

    Suppose I create DataFrame with Spark from new files, and convert it to pandas using built-in method toPandas(),

    • Does it store the Pandas object to local memory?
    • Does Pandas low-level computation handled all by Spark?
    • Does it exposed all pandas dataframe functionality?(I guess yes)
    • Can I convert it toPandas and just be done with it, without so much touching DataFrame API?
  • Napitupulu Jon
    Napitupulu Jon about 9 years
    Thank you for answering my questions. Actually maybe I'm not being clear enough. I'm a beginner of spark.I'm just testing here to load from csv.I'm required to read data that's too big to handle in memory and do data analysis. So the goal here is to do some data analysis within Hadoop. So when I load data from Hadoop(hive), converting to pandas will load it into local memory?
  • Napitupulu Jon
    Napitupulu Jon about 9 years
    and I'm not using hadoop on single machine. I may have to load data with hive from hdfs. If I convert it to pandas, Can I do pandas within distributed systems?
  • Phillip Cloud
    Phillip Cloud about 9 years
    Ah. I see. Spark DataFrames and Pandas DataFrames share no computational infrastructure. Spark DataFrames emulate the API of pandas DataFrames where it makes sense. If you're looking for something that lets you operate in a pandas like way on the Hadoop ecosystem that additionally lets you go into memory with a pandas DataFrame, check out blaze.
  • fanfabbb
    fanfabbb almost 9 years
    apart from blaze, sparklingpandas also aims to provide pandas-similar API on Spark DataFrames: github.com/sparklingpandas/sparklingpandas
  • luthfianto
    luthfianto almost 9 years
    Can I read the csv with Pandas DataFrame first then convert it to Spark DataFrame?
  • Phillip Cloud
    Phillip Cloud almost 9 years
    Yes you can pass a pandas DataFrame to HiveContext.createDataFrame.
  • Matthias
    Matthias almost 8 years
    If I'm not mistaken, the Spark dataframe is not local meaning that (depending on the size of the file) several computing nodes will load parts of the file and therefore hold only parts of the dataframe. Map and Filter functions are then done on that part of the data only. To gather the dataframe onto one local machine you need to use Collect. toPandas seems to do the same. Collect the data and convert it to a Pandas local DataFrame.
  • TheProletariat
    TheProletariat almost 6 years
    Hey @PhillipCloud, would you consider modifying your answer to not include the top part, which answers a different question than the OP asked, and also clarifying 'in memory' to differentiate between local (master) memory and distributed (worker) memory? Thanks!
  • Lawrence
    Lawrence about 4 years
    when you say "huge" amount of data, how big are we talking? Although I'm not running into limits with pandas reading files, the read time is quite long (10-20 sec). Would spark help in this instance?