Filter by array value in Spark DataFrame
11,844
In Spark 1.5+ you can use array_contains
function:
df.where(array_contains($"people.artist.id", "153"))
If you use an earlier version you can try an UDF like this:
val containsId = udf(
(rs: Seq[Row], v: String) => rs.map(_.getAs[String]("id")).exists(_ == v))
df.where(containsId($"people.artist", lit("153")))
Comments
-
zt1983811 almost 2 years
I am using apache spark 1.5 dataframe with elasticsearch, I am try to filter id from a column that contains a list(array) of ids.
For example the mapping of elasticsearch column is looks like this:
{ "people":{ "properties":{ "artist":{ "properties":{ "id":{ "index":"not_analyzed", "type":"string" }, "name":{ "type":"string", "index":"not_analyzed", } } } } }
The example data format will be like following
{ "people": { "artist": { [ { "id": "153", "name": "Tom" }, { "id": "15389", "name": "Cok" } ] } } }, { "people": { "artist": { [ { "id": "369", "name": "Carl" }, { "id": "15389", "name": "Cok" }, { "id": "698", "name": "Sol" } ] } } }
In spark I try this:
val peopleId = 152 val dataFrame = sqlContext.read .format("org.elasticsearch.spark.sql") .load("index/type") dataFrame.filter(dataFrame("people.artist.id").contains(peopleId)) .select("people_sequence.artist.id")
I got all the id that is contains 152, for example 1523 , 152978 but not only id == 152
Then I tried
dataFrame.filter(dataFrame("people.artist.id").equalTo(peopleId)) .select("people.artist.id")
I get empty, I understand why, it's because I have array of people.artist.id
Can anyone tell me how to filter when I have list of ids ?