Take n rows from a spark dataframe and pass to toPandas()
138,209
Solution 1
You can use the limit(n)
function:
l = [('Alice', 1),('Jim',2),('Sandra',3)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df.limit(2).withColumn('age2', df.age + 2).toPandas()
Or:
l = [('Alice', 1),('Jim',2),('Sandra',3)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df.withColumn('age2', df.age + 2).limit(2).toPandas()
Solution 2
You could get first rows of Spark DataFrame with head and then create Pandas DataFrame:
l = [('Alice', 1),('Jim',2),('Sandra',3)]
df = sqlContext.createDataFrame(l, ['name', 'age'])
df_pandas = pd.DataFrame(df.head(3), columns=df.columns)
In [4]: df_pandas
Out[4]:
name age
0 Alice 1
1 Jim 2
2 Sandra 3
Author by
jamiet
Updated on July 09, 2020Comments
-
jamiet almost 4 years
I have this code:
l = [('Alice', 1),('Jim',2),('Sandra',3)] df = sqlContext.createDataFrame(l, ['name', 'age']) df.withColumn('age2', df.age + 2).toPandas()
Works fine, does what it needs to. Suppose though I only want to display the first n rows, and then call
toPandas()
to return a pandas dataframe. How do I do it? I can't calltake(n)
because that doesn't return a dataframe and thus I can't pass it totoPandas()
.So to put it another way, how can I take the top n rows from a dataframe and call
toPandas()
on the resulting dataframe? Can't think this is difficult but I can't figure it out.I'm using Spark 1.6.0.
-
jamiet over 6 yearsis there a significant difference between
head()
andlimit()
? -
Anton Protopopov over 6 years@jamiet
head
return first n rows liketake
, andlimit
limits resulted Spark Dataframe to a specified number. Probably in that caselimit
is more appropriate. -
Anton Protopopov over 6 years
-
jamiet over 6 yearsah, easy. So
limit()
is a transformation,head()
is an action. Thanks Anton. -
Karan Sharma over 3 yearsIt is not safe to assume rerunning data frame.limit(2) will always return the same result(it's not deterministic). I tried this and got stuck in hours of debugging.
-
haneulkim over 2 years@KaranSharma does that mean when we use limit(n) we are randomly selecting n rows instead of returning top n rows?
-
Karan Sharma over 2 years@haneulkim Yes, you are right. Limit randomly selects the rows it wants to.