need instance of RDD but returned class 'pyspark.rdd.PipelinedRDD'

12,585

pyspark.rdd.PipelinedRDD is a subclass of RDD and it must have all the API's defined in the RDD. ie. PipelinedRDD is just a special type of RDD which is created when you run a map function on an RDD.

for example, take a look at the below snippet.

>>> rdd = spark.sparkContext.parallelize(range(1,10))
>>> type(rdd)
<class 'pyspark.rdd.RDD'> ## the type is RDD here
>>> rdd = rdd.map(lambda x: x * x)
>>> type(rdd)
<class 'pyspark.rdd.PipelinedRDD'> ## after the map operation the type is changed to pyspark.rdd.PipelinedRDD

so you should just treat your pyspark.rdd.PipelinedRDD just as an RDD in your code.

There is no complete casting support in Python as it is a dynamically typed language. to forcefully convert your pyspark.rdd.PipelinedRDD to a normal RDD you can collect on rdd and parallelize it back

>>> rdd = spark.sparkContext.parallelize(rdd.collect())
>>> type(rdd)
<class 'pyspark.rdd.RDD'>

Running collect on an RDD may cause MemoryError if the RDD's data is large.

Share:
12,585
Oscar C.
Author by

Oscar C.

I love Internet of The Things (I.O.T) and all the technologies related to this topic, big data, electronics, systems architecture and cloud computing. I am also interested in the integration of systems in mobile applications, especially in Android.

Updated on August 09, 2022

Comments

  • Oscar C.
    Oscar C. over 1 year

    Hi I have this code in Notebooks and traying to code python spark:

     mydataNoSQL.createOrReplaceTempView("mytable")
     spark.sql("SELECT * from mytable")
     return mydataNoSQL
    
    def getsameData(df,spark):
    result = spark.sql("select * from mytable where temeperature is not null")
    return result.rdd.sample(False, 0.1).map(lambda row : (row.temperature))
    

    I need an instance RDD but I am geting an class 'pyspark.rdd.PipelinedRDD'

    Any help will be wellcome.

  • Oscar C.
    Oscar C. almost 7 years
    From what I understand from your answer, it could be used as a RDD in next instruction, it will not cause any problem, it is jsut same as RDD. buy anyway, is there any way to have it back to a upper class (RDD)?
  • rogue-one
    rogue-one almost 7 years
    @OscarC. yes, you should treat it as an RDD. why do you want to cast it to RDD?. Since Python is a dynamically typed language you would never have to bother about casting the type of the RDD.
  • Oscar C.
    Oscar C. almost 7 years
    I need in RDD since i must to delivery an exercise in my IOT Course. The exercise is evaluated by a software, a grade, not by a person. So I have not choice to explain to this software what you and me knows, For a real case I know it is not need to cast, but this sofware does not undersand it,..
  • rogue-one
    rogue-one almost 7 years
    @OscarC updated the answer to convert it back to RDD.
  • Oscar C.
    Oscar C. almost 7 years
    Thanks a lot. I remember when exams and exercises were reviwed by humans...
  • DachuanZhao
    DachuanZhao over 2 years
    Can you give a solution based on @OscarC. 's code ?