Create hash value for each row of data with selected columns in dataframe in python pandas
Solution 1
Or simply:
df.apply(lambda x: hash(tuple(x)), axis = 1)
As an example:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(3,5))
print df
df.apply(lambda x: hash(tuple(x)), axis = 1)
0 1 2 3 4
0 0.728046 0.542013 0.672425 0.374253 0.718211
1 0.875581 0.512513 0.826147 0.748880 0.835621
2 0.451142 0.178005 0.002384 0.060760 0.098650
0 5024405147753823273
1 -798936807792898628
2 -8745618293760919309
Solution 2
This is now available in pandas.util.hash_pandas_object
:
pandas.util.hash_pandas_object(df)
Solution 3
Create hash value for each row of data with selected columns in dataframe in python pandas
These solutions work for the life of the Python process.
If order matters, one method would be to coerce the row (a Series object) to a tuple:
>>> hash(tuple(df.irow(1)))
-4901655572611365671
This demonstrates order matters for tuple hashing:
>>> hash((1,2,3))
2528502973977326415
>>> hash((3,2,1))
5050909583595644743
To do so for every row, appended as a column would look like this:
>>> df = df.drop('hash', 1) # lose the old hash
>>> df['hash'] = pd.Series((hash(tuple(row)) for _, row in df.iterrows()))
>>> df
y x0 hash
0 11.624345 10 -7519341396217622291
1 10.388244 11 -6224388738743104050
2 11.471828 12 -4278475798199948732
3 11.927031 13 -1086800262788974363
4 14.865408 14 4065918964297112768
5 12.698461 15 8870116070367064431
6 17.744812 16 -2001582243795030948
7 16.238793 17 4683560048732242225
8 18.319039 18 -4288960467160144170
9 18.750630 19 7149535252257157079
[10 rows x 3 columns]
If order does not matter, use the hash of frozensets instead of tuples:
>>> hash(frozenset((3,2,1)))
-272375401224217160
>>> hash(frozenset((1,2,3)))
-272375401224217160
Avoid summing the hashes of all of the elements in the row, as this could be cryptographically insecure and lead to hashes that fall outside the range of the original.
(You could use modulo to constrain the range, but this amounts to rolling your own hash function, and the best practice is not to.)
You can make permanent cryptographic quality hashes, for example using sha256, as well using the hashlib
module.
There is some discussion of the API for cryptographic hash functions in PEP 452.
Thanks to users Jamie Marshal and Discrete Lizard for their comments.
lokheart
Updated on July 09, 2022Comments
-
lokheart almost 2 years
I have asked similar question in R about creating hash value for each row of data. I know that I can use something like
hashlib.md5(b'Hello World').hexdigest()
to hash a string, but how about a row in a dataframe?update 01
I have drafted my code as below:
for index, row in course_staff_df.iterrows(): temp_df.loc[index,'hash'] = hashlib.md5(str(row[['cola','colb']].values)).hexdigest()
It seems not very pythonic to me, any better solution?
-
Mark Rotteveel about 4 yearsPlease don't post only code as answer, but also provide an explanation what your code does and how it solves the problem of the question. Answers with an explanation are usually more helpful and of better quality, and are more likely to attract upvotes.
-
Russia Must Remove Putin over 3 yearsThis doesn't answer the question: "Create hash value for each row of data with selected columns in DataFrame in Python Pandas" - a row is not semantically a Pandas object in the first place - the docs say the function you gave: "Return a data hash of the Index/Series/DataFrame" - none of these are "rows"
-
Neal Fultz over 3 yearsYeah, the documentation is not great.