Pandas - Generate Unique ID based on row values

11,925

Solution 1

You can try using hash function.

df['id'] = df[['first', 'last']].sum(axis=1).map(hash)

Please note the hash id is greater than 10 digits and is a unique integer sequence.

Solution 2

Here's a way of doing using numpy

import numpy as np
np.random.seed(1)

# create a list of unique names
names = df[['first', 'last']].agg(' '.join, 1).unique().tolist()

# generte ids
ids = np.random.randint(low=1e9, high=1e10, size = len(names))

# maps ids to names
maps = {k:v for k,v in zip(names, ids)}

# add new id column
df['id'] = df[['first', 'last']].agg(' '.join, 1).map(maps)

   index  first   last       dob          id
0      0  peter  jones  20000101  9176146523
1      1   john    doe  19870105  8292931172
2      2   adam  smith  19441212  4108641136
3      3   john    doe  19870105  8292931172
4      4  jenny   fast  19640822  6385979058
Share:
11,925
swifty
Author by

swifty

Updated on August 05, 2022

Comments

  • swifty
    swifty almost 2 years

    I would like to generate an integer-based unique ID for users (in my df).

    Let's say I have:

    index  first  last    dob
    0      peter  jones   20000101
    1      john   doe     19870105
    2      adam   smith   19441212
    3      john   doe     19870105
    4      jenny  fast    19640822
    

    I would like to generate an ID column like so:

    index  first  last    dob       id
    0      peter  jones   20000101  1244821450
    1      john   doe     19870105  1742118427
    2      adam   smith   19441212  1841181386
    3      john   doe     19870105  1742118427
    4      jenny  fast    19640822  1687411973
    

    10 digit ID, but it's based on the value of the fields (john doe identical row values get the same ID).

    I've looked into hashing, encrypting, UUID's but can't find much related to this specific non-security use case. It's just about generating an internal identifier.

    • I can't use groupby/cat code type methods in case the order of the rows change.
    • The dataset won't grow beyond 50k rows.
    • Safe to assume there won't be a first, last, dob duplicate.

    Feel like I may be tackling this the wrong way as I can't find much literature on it!

    Thanks

    • Jon Clements
      Jon Clements over 4 years
      Does something like: df.groupby(['first', 'last', 'dob'], sort=False).ngroup().apply('{:010}'.format) do what you want?
    • Mahendra Singh
      Mahendra Singh over 4 years
      You can follow this thread to learn more about hashing stackoverflow.com/questions/16008670/…
  • swifty
    swifty over 4 years
    This is pretty nice though I'm getting some 9 digit ID's mixed in
  • RockStar
    RockStar over 4 years
    Can you share couple of string where 9 digits generated?
  • Umar.H
    Umar.H over 4 years
    would you need to use seed to make the generation consistent?
  • swifty
    swifty over 4 years
    Sarah Wood, Tom Almond
  • RockStar
    RockStar over 4 years
    I have tested on multiple environments, it generating 10 digits only. Check on this link - onlinegdb.com/ByUhl5z48
  • RockStar
    RockStar over 4 years
    @swifty Add some code, you can use, test out, modify the same.
  • swifty
    swifty over 4 years
    this is bad code but should demonstrate it - onlinegdb.com/rJ6o_qGNU
  • RockStar
    RockStar over 4 years
    @swifty I tested your code with my updated function in the answer it works properly. Check - onlinegdb.com/B1tucqfN8
  • RockStar
    RockStar over 4 years
    @swifty Does it helped?