Spark & Scala - Cannot Filter null Values from RDD

10,119

Solution 1

Ratings.filter ( x => x._1 != null ) 

this actually transforms the RDD but you are not using that particular RDD. U can try

Ratings.filter(_._1 !=null).foreach(println)

Solution 2

RDDs are immutable objects - any transformation on an RDD doesn't change that original RDD, but rather produces a new one. So - you should use the RDD returned from filter (just like you do with the result of map) if you want to see the effect of filter:

val result = Ratings.filter ( x => x._1 != null )
result.foreach(println)
Share:
10,119
questionasker
Author by

questionasker

I'm Web, Unity3D & Flutter Developer. I love to share my ideas at my web, please visit my website for any tutorial related to marketing, programming, docker, linux, etc

Updated on June 29, 2022

Comments

  • questionasker
    questionasker almost 2 years

    i tried to filter null values from RDD but failed. Here's my code :

    val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
          classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
          classOf[org.apache.hadoop.hbase.client.Result])
    
    val raw_hbaserdd = hBaseRDD.map{
      kv => kv._2
    }
    
    val Ratings = raw_hbaseRDD.map {
          result =>  val x = Bytes.toString(result.getValue(Bytes.toBytes("data"),Bytes.toBytes("user")))
                     val y = Bytes.toString(result.getValue(Bytes.toBytes("data"),Bytes.toBytes("item")))
                     val z = Bytes.toString(result.getValue(Bytes.toBytes("data"),Bytes.toBytes("rating")))
    
                     (x,y, z)
        }
    Ratings.filter ( x => x._1 != null )
    
    Ratings.foreach(println)
    

    when Debugging, null value still appeared after Filter :

    (3359,1494,4)
    (null,null,null)
    (28574,1542,5)
    (null,null,null)
    (12062,1219,5)
    (14068,1459,3)
    

    any Better idea ?