Upacking a list to select multiple columns from a spark data frame

81,668

Solution 1

Use df.select(cols.head, cols.tail: _*)

Let me know if it works :)

Explanation from @Ben:

The key is the method signature of select:

select(col: String, cols: String*)

The cols:String* entry takes a variable number of arguments. :_* unpacks arguments so that they can be handled by this argument. Very similar to unpacking in python with *args. See here and here for other examples.

Solution 2

You can typecast String to spark column like this:

import org.apache.spark.sql.functions._
df.select(cols.map(col): _*)

Solution 3

Another option that I've just learnt.

import org.apache.spark.sql.functions.col
val columns = Seq[String]("col1", "col2", "col3")
val colNames = columns.map(name => col(name))
val df = df.select(colNames:_*)

Solution 4

First convert the String Array to a List of Spark dataset Column type as below

String[] strColNameArray = new String[]{"a", "b", "c", "d"};

List<Column> colNames = new ArrayList<>();

for(String strColName : strColNameArray){
    colNames.add(new Column(strColName));
}

then convert the List using JavaConversions functions within the select statement as below. You need the following import statement.

import scala.collection.JavaConversions;

Dataset<Row> selectedDF = df.select(JavaConversions.asScalaBuffer(colNames ));

Solution 5

Yes , You can make use of .select in scala.

Use .head and .tail to select the whole values mentioned in the List()

Example

val cols = List("b", "c")
df.select(cols.head,cols.tail: _*)

Explanation

Share:
81,668
Ben
Author by

Ben

Updated on March 27, 2020

Comments

  • Ben
    Ben about 4 years

    I have a spark data frame df. Is there a way of sub selecting a few columns using a list of these columns?

    scala> df.columns
    res0: Array[String] = Array("a", "b", "c", "d")
    

    I know I can do something like df.select("b", "c"). But suppose I have a list containing a few column names val cols = List("b", "c"), is there a way to pass this to df.select? df.select(cols) throws an error. Something like df.select(*cols) as in python

  • Ben
    Ben over 8 years
    Thanks! Worked like a charm. Could explain a bit more about the syntax? Specifically what does col.tail: _ * do?
  • Ben
    Ben over 8 years
    I think I understand now. The key is the method signature of select select(col: String, cols: String*). The cols:String* entry takes a variable number of arguments. :_* unpacks arguments so that they can be handled by this argument. Very similar to unpacking in python with *args. See here and here for other examples.
  • Shagun Sodhani
    Shagun Sodhani over 8 years
    Cool! You got it right :) Sorry I got both the notifications just now so couldn't reply earlier. :)
  • Ben
    Ben over 8 years
    No problem. Thanks again!
  • MaxU - stop genocide of UA
    MaxU - stop genocide of UA over 6 years
    What about a bit shorter version: df.select(cols.map(df(_)): _*) ?
  • user1326784
    user1326784 about 4 years
    Can you please share how to do the same(pass the column names) in java while doing dataframeResult = inpDataframe.select("col1","col2",....)
  • Olfa2
    Olfa2 about 2 years
    can you elaborate plz ?