Pyspark RDD collect first 163 Rows

python apache-spark pyspark rdd

10,209

It is not very efficient but you can zipWithIndex and filter:

rdd.zipWithIndex().filter(lambda vi: vi[1] < 163).keys()

In practice it make more sense to simply take and parallelize:

sc.parallelize(rdd.take(163))

10,209

Author by

wheels

Updated on June 05, 2022

Comments

wheels almost 2 years

Is there a way to get the first 163 rows of an rdd without converting to a df?

I've tried something like newrdd = rdd.take(163), but that returns a list, and rdd.collect() returns the whole rdd.

Is there a way to do this? Or if not is there a way to convert a list into an rdd?

Recents

Why Is PNG file with Drop Shadow in Flutter Web App Grainy?

How to troubleshoot crashes detected by Google Play Store for Flutter app

Cupertino DateTime picker interfering with scroll behaviour

Why does awk -F work for most letters, but not for the letter "t"?

Flutter change focus color and icon color but not works

How to print and connect to printer using flutter desktop via usb?

Critical issues have been reported with the following SDK versions: com.google.android.gms:play-services-safetynet:17.0.0

Flutter Dart - get localized country name from country code

navigatorState is null when using pushNamed Navigation onGenerateRoutes of GetMaterialPage

Android Sdk manager not found- Flutter doctor error

Flutter Laravel Push Notification without using any third party like(firebase,onesignal..etc)

How to change the color of ElevatedButton when entering text in TextField

Related

Spark reading python3 pickle as input

PySpark - sortByKey() method to return values from k,v pairs in their original order

pyspark merge two rdd together

Filtering data in an RDD

How do you perform basic joins of two RDD tables in Spark using Python?

Convert a simple one line string to RDD in Spark

Spark RDD - Mapping with extra arguments

Spark union of multiple RDDs

Reduce a key-value pair into a key-list pair with Apache Spark

'PipelinedRDD' object has no attribute 'toDF' in PySpark