Spark dataframe get column value into a string variable
Solution 1
The col("name")
gives you a column expression. If you want to extract data from column "name" just do the same thing without col("name")
:
val names = test.filter(test("id").equalTo("200"))
.select("name")
.collectAsList() // returns a List[Row]
Then for a row you could get name in String by:
val name = row.getString(0)
Solution 2
val maxDate = spark.sql("select max(export_time) as export_time from tier1_spend.cost_gcp_raw").first()
val rowValue = maxDate.get(0)
Solution 3
By this snippet, you can extract all the values in a column into a string. Modify the snippet with where clauses to get your desired value.
val df = Seq((5, 2), (10, 1)).toDF("A", "B")
val col_val_df = df.select($"A").collect()
val col_val_str = col_val_df.map(x => x.get(0)).mkString(",")
/*
df: org.apache.spark.sql.DataFrame = [A: int, B: int]
col_val_row: Array[org.apache.spark.sql.Row] = Array([5], [10])
col_val_str: String = 5,10
*/
The value of entire column is stored in col_val_str
col_val_str: String = 5,10
Solution 4
For anyone interested below is an way to turn a column into an Array, for the below case we are just taking the first value.
val names= test.filter(test("id").equalTo("200")).selectExpr("name").rdd.map(x=>x.mkString).collect
val name = names(0)
Solution 5
Let us assume you need to pick the name
from the below table for a particular Id
and store that value in a variable.
+-----+-------+
| id | name |
+-----+-------+
| 100 | Alex |
| 200 | Bidan |
| 300 | Cary |
+-----+-------+
SCALA
-----------
Irrelevant data is filtered out first and then the name
column is selected and finally stored into name
variable
var name = df.filter($"id" === "100").select("name").collect().map(_.getString(0)).mkString("")
PYTHON (PYSPARK)
-----------------------------
For simpler usage, I have created a function that returns the value by passing the dataframe and the desired column name to this (this is spark Dataframe and not Pandas Dataframe). Before passing the dataframe to this function, filter
is applied to filter out other records.
def GetValueFromDataframe(_df,columnName):
for row in _df.rdd.collect():
return row[columnName].strip()
name = GetValueFromDataframe(df.filter(df.id == "100"),"name")
There might be more simpler approach than this using 3x
version of Python. The code which I showed above was tested for 2.7
version.
Note :
It is most likely to encounter out of memory error (Driver memory) since we use the collect
function. Hence it is always recommended to apply transformations (like filter
,where
etc) before you call the collect
function. If you
still encounter with driver out of memory issue, you could pass --conf spark.driver.maxResultSize=0
as command line argument to make use of unlimited driver memory.
G G
Updated on February 06, 2021Comments
-
G G over 3 years
I am trying extract column value into a variable so that I can use the value somewhere else in the code. I am trying like the following
val name= test.filter(test("id").equalTo("200")).select("name").col("name")
It returns
name org.apache.spark.sql.Column = name
how to get the value?