Hash function in spark
Solution 1
It is Murmur based on the source code:
/**
* Calculates the hash code of given columns, and returns the result as an int column.
*
* @group misc_funcs
* @since 2.0.0
*/
@scala.annotation.varargs
def hash(cols: Column*): Column = withExpr {
new Murmur3Hash(cols.map(_.expr))
}
Solution 2
If you want a Long hash, in spark 3 there is the xxhash64
function: https://spark.apache.org/docs/3.0.0-preview/api/sql/index.html#xxhash64.
You may want only positive numbers. In this case you may use hash
and sum Int.MaxValue
as
df.withColumn("hashID", hash($"value").cast(LongType)+Int.MaxValue).show()
![Viacheslav Shalamov](https://i.stack.imgur.com/aTlXt.jpg?s=256&g=1)
Viacheslav Shalamov
Data Engineer with Computer Science and Machine Learning background.
Updated on June 18, 2022Comments
-
Viacheslav Shalamov about 2 years
I'm trying to add a column to a dataframe, which will contain hash of another column.
I've found this piece of documentation: https://spark.apache.org/docs/2.3.0/api/sql/index.html#hash
And tried this:import org.apache.spark.sql.functions._ val df = spark.read.parquet(...) val withHashedColumn = df.withColumn("hashed", hash($"my_column"))
But what is the hash function used by that
hash()
? Is thatmurmur
,sha
,md5
, something else?The value I get in this column is integer, thus range of values here is probably
[-2^(31) ... +2^(31-1)]
.
Can I get a long value here? Can I get a string hash instead?
How can I specify a concrete hashing algorithm for that?
Can I use a custom hash function?