Take n rows from a spark dataframe and pass to toPandas()

138,209

Solution 1

You can use the limit(n) function:

l = [('Alice', 1),('Jim',2),('Sandra',3)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df.limit(2).withColumn('age2', df.age + 2).toPandas()

Or:

l = [('Alice', 1),('Jim',2),('Sandra',3)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df.withColumn('age2', df.age + 2).limit(2).toPandas()

Solution 2

You could get first rows of Spark DataFrame with head and then create Pandas DataFrame:

l = [('Alice', 1),('Jim',2),('Sandra',3)]
df = sqlContext.createDataFrame(l, ['name', 'age'])

df_pandas = pd.DataFrame(df.head(3), columns=df.columns)

In [4]: df_pandas
Out[4]: 
     name  age
0   Alice    1
1     Jim    2
2  Sandra    3
Share:
138,209
jamiet
Author by

jamiet

Updated on July 09, 2020

Comments

  • jamiet
    jamiet almost 4 years

    I have this code:

    l = [('Alice', 1),('Jim',2),('Sandra',3)]
    df = sqlContext.createDataFrame(l, ['name', 'age'])
    df.withColumn('age2', df.age + 2).toPandas()
    

    Works fine, does what it needs to. Suppose though I only want to display the first n rows, and then call toPandas() to return a pandas dataframe. How do I do it? I can't call take(n) because that doesn't return a dataframe and thus I can't pass it to toPandas().

    So to put it another way, how can I take the top n rows from a dataframe and call toPandas() on the resulting dataframe? Can't think this is difficult but I can't figure it out.

    I'm using Spark 1.6.0.

  • jamiet
    jamiet over 6 years
    is there a significant difference between head() and limit()?
  • Anton Protopopov
    Anton Protopopov over 6 years
    @jamiet head return first n rows like take, and limit limits resulted Spark Dataframe to a specified number. Probably in that case limit is more appropriate.
  • Anton Protopopov
    Anton Protopopov over 6 years
  • jamiet
    jamiet over 6 years
    ah, easy. So limit() is a transformation, head() is an action. Thanks Anton.
  • Karan Sharma
    Karan Sharma over 3 years
    It is not safe to assume rerunning data frame.limit(2) will always return the same result(it's not deterministic). I tried this and got stuck in hours of debugging.
  • haneulkim
    haneulkim over 2 years
    @KaranSharma does that mean when we use limit(n) we are randomly selecting n rows instead of returning top n rows?
  • Karan Sharma
    Karan Sharma over 2 years
    @haneulkim Yes, you are right. Limit randomly selects the rows it wants to.