first_value windowing function in pyspark
16,393
Spark >= 2.0:
first
takes an optional ignorenulls
argument which can mimic the behavior of first_value
:
df.select(col("k"), first("v", True).over(w).alias("fv"))
Spark < 2.0:
Available function is called first
and can be used as follows:
df = sc.parallelize([
("a", None), ("a", 1), ("a", -1), ("b", 3)
]).toDF(["k", "v"])
w = Window().partitionBy("k").orderBy("v")
df.select(col("k"), first("v").over(w).alias("fv"))
but if you want to ignore nulls you'll have to use Hive UDFs directly:
df.registerTempTable("df")
sqlContext.sql("""
SELECT k, first_value(v, TRUE) OVER (PARTITION BY k ORDER BY v)
FROM df""")
Author by
liber
Updated on August 07, 2022Comments
-
liber over 1 year
I am using pyspark 1.5 getting my data from Hive tables and trying to use windowing functions.
According to this there exists an analytic function called
firstValue
that will give me the first non-null value for a given window. I know this exists in Hive but I can not find this in pyspark anywhere.Is there a way to implement this given that pyspark won't allow UserDefinedAggregateFunctions (UDAFs)?