Pyspark RDD collect first 163 Rows

10,209

It is not very efficient but you can zipWithIndex and filter:

rdd.zipWithIndex().filter(lambda vi: vi[1] < 163).keys()

In practice it make more sense to simply take and parallelize:

sc.parallelize(rdd.take(163))
Share:
10,209
wheels
Author by

wheels

Updated on June 05, 2022

Comments

  • wheels
    wheels almost 2 years

    Is there a way to get the first 163 rows of an rdd without converting to a df?

    I've tried something like newrdd = rdd.take(163), but that returns a list, and rdd.collect() returns the whole rdd.

    Is there a way to do this? Or if not is there a way to convert a list into an rdd?