how to get first value and last value from dataframe column in pyspark?

16,947

Solution 1

You may use collect but the performance is going to be terrible since the driver will collect all the data, just to keep the first and last items. Worse than that, it will most likely cause an OOM error and thus not work at all if you have a big dataframe.

Another idea would be to use agg with the first and last aggregation function. This does not work! (because the reducers do not necessarily get the records in the order of the dataframe)

Spark offers a head function, which makes getting the first element very easy. However, spark does not offer any last function. A straightforward approach would be to sort the dataframe backward and use the head function again.

first=df.head().support
import pyspark.sql.functions as F
last=df.orderBy(F.monotonically_increasing_id().desc()).head().support

Finally, since it is a shame to sort a dataframe simply to get its first and last elements, we can use the RDD API and zipWithIndex to index the dataframe and only keep the first and the last elements.

size = df.count()
df.rdd.zipWithIndex()\
  .filter(lambda x : x[1] == 0 or x[1] == size-1)\
  .map(lambda x : x[0].support)\
  .collect()

Solution 2

You can try indexing the data frame see below example:

df = <your dataframe>
first_record = df.collect()[0]
last_record = df.collect()[-1]

EDIT: You have to pass the column name as well.

df = <your dataframe>
first_record = df.collect()[0]['column_name']
last_record = df.collect()[-1]['column_name']
Share:
16,947
Sai
Author by

Sai

Updated on July 18, 2022

Comments

  • Sai
    Sai almost 2 years

    I Have Dataframe,I want get first value and last value from DataFrame column.

    +----+-----+--------------------+
    |test|count|             support|
    +----+-----+--------------------+
    |   A|    5| 0.23809523809523808|
    |   B|    5| 0.23809523809523808|
    |   C|    4| 0.19047619047619047|
    |   G|    2| 0.09523809523809523|
    |   K|    2| 0.09523809523809523|
    |   D|    1|0.047619047619047616|
    +----+-----+--------------------+
    

    expecting output is from support column first,last value i.e x=[0.23809523809523808,0.047619047619047616.]