Pandas - Generate Unique ID based on row values

python pandas hash

11,925

Solution 1

You can try using hash function.

df['id'] = df[['first', 'last']].sum(axis=1).map(hash)

Please note the hash id is greater than 10 digits and is a unique integer sequence.

Solution 2

Here's a way of doing using numpy

import numpy as np
np.random.seed(1)

# create a list of unique names
names = df[['first', 'last']].agg(' '.join, 1).unique().tolist()

# generte ids
ids = np.random.randint(low=1e9, high=1e10, size = len(names))

# maps ids to names
maps = {k:v for k,v in zip(names, ids)}

# add new id column
df['id'] = df[['first', 'last']].agg(' '.join, 1).map(maps)

   index  first   last       dob          id
0      0  peter  jones  20000101  9176146523
1      1   john    doe  19870105  8292931172
2      2   adam  smith  19441212  4108641136
3      3   john    doe  19870105  8292931172
4      4  jenny   fast  19640822  6385979058

11,925

Author by

swifty

Updated on August 05, 2022

Comments

swifty almost 2 years
I would like to generate an integer-based unique ID for users (in my df).

Let's say I have:
```
index  first  last    dob
0      peter  jones   20000101
1      john   doe     19870105
2      adam   smith   19441212
3      john   doe     19870105
4      jenny  fast    19640822
```
I would like to generate an ID column like so:
```
index  first  last    dob       id
0      peter  jones   20000101  1244821450
1      john   doe     19870105  1742118427
2      adam   smith   19441212  1841181386
3      john   doe     19870105  1742118427
4      jenny  fast    19640822  1687411973
```
10 digit ID, but it's based on the value of the fields (john doe identical row values get the same ID).

I've looked into hashing, encrypting, UUID's but can't find much related to this specific non-security use case. It's just about generating an internal identifier.
- I can't use groupby/cat code type methods in case the order of the rows change.
- The dataset won't grow beyond 50k rows.
- Safe to assume there won't be a first, last, dob duplicate.
Feel like I may be tackling this the wrong way as I can't find much literature on it!

Thanks
- Jon Clements over 4 years
  
  Does something like: df.groupby(['first', 'last', 'dob'], sort=False).ngroup().apply('{:010}'.format) do what you want?
- Mahendra Singh over 4 years
  
  You can follow this thread to learn more about hashing stackoverflow.com/questions/16008670/…
swifty over 4 years

This is pretty nice though I'm getting some 9 digit ID's mixed in
RockStar over 4 years

Can you share couple of string where 9 digits generated?
Umar.H over 4 years

would you need to use seed to make the generation consistent?
swifty over 4 years

Sarah Wood, Tom Almond
RockStar over 4 years

I have tested on multiple environments, it generating 10 digits only. Check on this link - onlinegdb.com/ByUhl5z48
RockStar over 4 years

@swifty Add some code, you can use, test out, modify the same.
swifty over 4 years

this is bad code but should demonstrate it - onlinegdb.com/rJ6o_qGNU
RockStar over 4 years

@swifty I tested your code with my updated function in the answer it works properly. Check - onlinegdb.com/B1tucqfN8
RockStar over 4 years

@swifty Does it helped?