Hash function in spark

18,166

Solution 1

It is Murmur based on the source code:

  /**
   * Calculates the hash code of given columns, and returns the result as an int column.
   *
   * @group misc_funcs
   * @since 2.0.0
   */
  @scala.annotation.varargs
  def hash(cols: Column*): Column = withExpr {
    new Murmur3Hash(cols.map(_.expr))
  }

Solution 2

If you want a Long hash, in spark 3 there is the xxhash64 function: https://spark.apache.org/docs/3.0.0-preview/api/sql/index.html#xxhash64.

You may want only positive numbers. In this case you may use hash and sum Int.MaxValue as

df.withColumn("hashID", hash($"value").cast(LongType)+Int.MaxValue).show()
Share:
18,166
Viacheslav Shalamov
Author by

Viacheslav Shalamov

Data Engineer with Computer Science and Machine Learning background.

Updated on June 18, 2022

Comments

  • Viacheslav Shalamov
    Viacheslav Shalamov about 2 years

    I'm trying to add a column to a dataframe, which will contain hash of another column.

    I've found this piece of documentation: https://spark.apache.org/docs/2.3.0/api/sql/index.html#hash
    And tried this:

    import org.apache.spark.sql.functions._
    val df = spark.read.parquet(...)
    val withHashedColumn = df.withColumn("hashed", hash($"my_column"))
    

    But what is the hash function used by that hash()? Is that murmur, sha, md5, something else?

    The value I get in this column is integer, thus range of values here is probably [-2^(31) ... +2^(31-1)].
    Can I get a long value here? Can I get a string hash instead?
    How can I specify a concrete hashing algorithm for that?
    Can I use a custom hash function?