Convert unique numbers to md5 hash using pandas

11,106

Solution 1

hashlib.md5 takes a single string as input -- you can't pass it an array of values as you can with some NumPy/Pandas functions. So instead, you could use a list comprehension to build a list of md5sums:

ob['md5'] = [hashlib.md5(val).hexdigest() for val in ob['ssno']]

Solution 2

In case you are hashing to SHA256, you'll need to encode your string first to (probably) UTF-8:

ob['sha256'] = [hashlib.sha256(val.encode('UTF-8')).hexdigest() for val in ob['ssno']]
Share:
11,106
Dave
Author by

Dave

I am a Program Analyst. I have been using python and pandas for data analysis since 2011. Recently I started developing dashboards with dash and plotly. One day, I'd like to contribute to the cyber world by providing simple dynamic visuals for just about everything. I am extremely grateful for the stackoverflow community.

Updated on July 30, 2022

Comments

  • Dave
    Dave almost 2 years

    Good morning, All.

    I want to convert my social security numbers to a md5 hash hex number. The outcome should be a unique md5 hash hex number for each social security number.

    My data format is as follows:

    ob = onboard[['regions','lname','ssno']][:10]
    ob
    
        regions lname   ssno
    0    Northern Region (R1)    Banderas    123456789
    1    Northern Region (R1)    Garfield    234567891
    2    Northern Region (R1)    Pacino  345678912
    3    Northern Region (R1)    Baldwin     456789123
    4    Northern Region (R1)    Brody   567891234
    5    Northern Region (R1)    Johnson     6789123456
    6    Northern Region (R1)    Guinness    7890123456
    7    Northern Region (R1)    Hopkins     891234567
    8    Northern Region (R1)    Paul    891234567
    9    Northern Region (R1)    Arkin   987654321
    

    I've tried the following code using hashlib:

    import hashlib
    
    ob['md5'] = hashlib.md5(['ssno'])
    

    This gave me the error that it had to be a string not a list. So I tried the following:

    ob['md5'] = hashlib.md5('ssno').hexdigest()
    
    
    
    regions lname   ssno    md5
    0    Northern Region (R1)    Banderas    123456789   a1b3ec3d8a026d392ad551701ad7881e
    1    Northern Region (R1)    Garfield    234567891   a1b3ec3d8a026d392ad551701ad7881e
    2    Northern Region (R1)    Pacino  345678912   a1b3ec3d8a026d392ad551701ad7881e
    3    Northern Region (R1)    Baldwin     456789123   a1b3ec3d8a026d392ad551701ad7881e
    4    Northern Region (R1)    Brody   567891234   a1b3ec3d8a026d392ad551701ad7881e
    5    Northern Region (R1)    Johnson     678912345   a1b3ec3d8a026d392ad551701ad7881e
    6    Northern Region (R1)    Johnson     789123456   a1b3ec3d8a026d392ad551701ad7881e
    7    Northern Region (R1)    Guiness     891234567   a1b3ec3d8a026d392ad551701ad7881e
    8    Northern Region (R1)    Hopkins     912345678   a1b3ec3d8a026d392ad551701ad7881e
    9    Northern Region (R1)    Paul    159753456   a1b3ec3d8a026d392ad551701ad7881e
    

    This was very close to what I need but all the hex numbers came out the same regardless if the social security number was different or not. I am trying to get a hex number with unique hex numbers for each social security number.

    Any suggestions?

  • Dave
    Dave about 9 years
    Absolutely, Beautiful! Makes sense. Thanks for educating me and assisting with a solution! Exactly what I needed!
  • rocksteady
    rocksteady over 5 years
    For anyone hitting 'object supporting the buffer API required' error on this, it can be caused null (NaN) values in your Pandas series that may need to be processed or removed before hashing.