Filter by array value in Spark DataFrame

11,844

In Spark 1.5+ you can use array_contains function:

df.where(array_contains($"people.artist.id", "153"))

If you use an earlier version you can try an UDF like this:

val containsId = udf(
  (rs: Seq[Row], v: String) => rs.map(_.getAs[String]("id")).exists(_ == v))
df.where(containsId($"people.artist", lit("153")))
Share:
11,844
zt1983811
Author by

zt1983811

Software developer

Updated on June 04, 2022

Comments

  • zt1983811
    zt1983811 almost 2 years

    I am using apache spark 1.5 dataframe with elasticsearch, I am try to filter id from a column that contains a list(array) of ids.

    For example the mapping of elasticsearch column is looks like this:

        {
            "people":{
                "properties":{
                    "artist":{
                       "properties":{
                          "id":{
                             "index":"not_analyzed",
                             "type":"string"
                           },
                           "name":{
                              "type":"string",
                              "index":"not_analyzed",
                           }
                       }
                   }
              }
        }
    

    The example data format will be like following

    {
        "people": {
            "artist": {
                [
                      {
                           "id": "153",
                           "name": "Tom"
                      },
                      {
                           "id": "15389",
                           "name": "Cok"
                      }
                ]
            }
        }
    },
    {
        "people": {
            "artist": {
                [
                      {
                           "id": "369",
                           "name": "Carl"
                      },
                      {
                           "id": "15389",
                           "name": "Cok"
                      },
                     {
                           "id": "698",
                           "name": "Sol"
                      }
                ]
            }
        }
    }
    

    In spark I try this:

    val peopleId  = 152
    val dataFrame = sqlContext.read
         .format("org.elasticsearch.spark.sql")
         .load("index/type")
    
    dataFrame.filter(dataFrame("people.artist.id").contains(peopleId))
        .select("people_sequence.artist.id")
    

    I got all the id that is contains 152, for example 1523 , 152978 but not only id == 152

    Then I tried

    dataFrame.filter(dataFrame("people.artist.id").equalTo(peopleId))
        .select("people.artist.id")
    

    I get empty, I understand why, it's because I have array of people.artist.id

    Can anyone tell me how to filter when I have list of ids ?