Using Python Faker generate different data for 5000 rows

10,818

Solution 1

I placed the fake stuff array inside my for loop to achieve the desired result:

for i in range(10):
    stuff = [fake.name()
        , fake.email()
        , fake.bs()
        , fake.address()
        , fake.city()
        , fake.state()
        , fake.date_time()
        , fake.paragraph()
        , fake.catch_phrase()
        , random.randint(1000, 2000)]
    df.loc[i] = [item for item in stuff]
    print(df)

Solution 2

Disclaimer: this answer is added much after the question and adds some new info not directly answering the question.

Now there is a fast new library Mimesis - Fake Data Generator.

  • Upside: It is stated it works times faster than faker (see below my test of data similar to one in question).
  • Downside: works from 3.6 version of Python only.

pip install mimesis

>>> from mimesis import Person
>>> from mimesis.enums import Gender
>>> person = Person('en')

>>> person.full_name(gender=Gender.FEMALE)
'Antonetta Garrison'
>>> personru = Person('ru')
>>> personru.full_name()
'Рената Черкасова'

The same with developed earlier faker:

pip install faker

>>> from faker import Faker
>>> fake_ru=Faker('ja_JP')
>>> fake_ru=Faker('ru_RU')
>>> fake_jp=Faker('ja_JP')
>>> print (fake_ru.name())
Субботина Елена Наумовна
>>> print (fake_jp.name())
大垣 花子

Below it my recent timing of Mimesis vs. Faker based on code provided in answer from forzer0eight:

from faker import Faker
import pandas as pd
import random
fake = Faker()
def create_rows_faker(num=1):
    output = [{"name":fake.name(),
                   "address":fake.address(),
                   "name":fake.name(),
                   "email":fake.email(),
                   #"bs":fake.bs(),
                   "city":fake.city(),
                   "state":fake.state(),
                   "date_time":fake.date_time(),
                   #"paragraph":fake.paragraph(),
                   #"Conrad":fake.catch_phrase(),
                   "randomdata":random.randint(1000,2000)} for x in range(num)]
    return output

%%time
df_faker = pd.DataFrame(create_rows_faker(5000))

CPU times: user 3.51 s, sys: 2.86 ms, total: 3.51 s Wall time: 3.51 s

from mimesis import Person
from mimesis import Address
from mimesis.enums import Gender
from mimesis import Datetime
person = Person('en')
import pandas as pd
import random
person = Person()
addess = Address()
datetime = Datetime()
def create_rows_mimesis(num=1):
    output = [{"name":person.full_name(gender=Gender.FEMALE),
                   "address":addess.address(),
                   "name":person.name(),
                   "email":person.email(),
                   #"bs":person.bs(),
                   "city":addess.city(),
                   "state":addess.state(),
                   "date_time":datetime.datetime(),
                   #"paragraph":person.paragraph(),
                   #"Conrad":person.catch_phrase(),
                   "randomdata":random.randint(1000,2000)} for x in range(num)]
    return output

%%time
df_mimesis = pd.DataFrame(create_rows_mimesis(5000))

CPU times: user 178 ms, sys: 1.7 ms, total: 180 ms Wall time: 179 ms

Below is resulting data for comparison:

df_faker.head(2)
address city    date_time   email   name    randomdata  state
0   3818 Goodwin Haven\nBrocktown, GA 06168 Valdezport  2004-10-18 20:35:52 [email protected] Deborah Garcia  1218    Oklahoma
1   2568 Gonzales Field\nRichardhaven, NC 79149 West Rachel 1985-02-03 00:33:00 [email protected]  Barbara Pineda  1536    Tennessee

df_mimesis.head(2)
address city    date_time   email   name    randomdata  state
0   351 Nobles Viaduct  Cedar Falls 2013-08-22 08:20:25.288883  [email protected] Ernest  1673    Georgia
1   517 Williams Hill   Malden  2008-01-26 18:12:01.654995  [email protected]  Jonathan    1845    North Dakota

Solution 3

Following scripts can remarkably enhance the pandas performance.

    from faker import Faker
    import pandas as pd
    import random
    fake = Faker()
    def create_rows(num=1):
        output = [{"name":fake.name(),
                   "address":fake.address(),
                   "name":fake.name(),
                   "email":fake.email(),
                   "bs":fake.bs(),
                   "address":fake.address(),
                   "city":fake.city(),
                   "state":fake.state(),
                   "date_time":fake.date_time(),
                   "paragraph":fake.paragraph(),
                   "Conrad":fake.catch_phrase(),
                   "randomdata":random.randint(1000,2000)} for x in range(num)]
        return output

It takes 5.55s.

    %%time
    df = pd.DataFrame(create_rows(5000))

    Wall time: 5.55 s

Solution 4

Using the farsante and mimesis libraries is the easiest way to create Pandas DataFrames with fake data.

import random
import farsante
from mimesis import Person
from mimesis import Address
from mimesis import Datetime

person = Person()
address = Address()
datetime = Datetime()
def rand_int(min_int, max_int):
    def some_rand_int():
        return random.randint(min_int, max_int)
    return some_rand_int
df = farsante.pandas_df([
    person.full_name,
    address.address,
    person.name,
    person.email,
    address.city,
    address.state,
    datetime.datetime,
    rand_int(1000, 2000)], 5)

print(df)
        full_name              address    name  ...          state                   datetime some_rand_int
0   Weldon Durham   1027 Nellie Square   Bruna  ...  West Virginia 2030-06-10 09:21:29.179412          1453
1     Veta Conrad  932 Cragmont Arcade  Betsey  ...           Iowa 2017-08-11 23:50:27.479281          1909
2     Vena Kinney    355 Edgar Highway   Tyson  ...  New Hampshire 2002-12-21 05:26:45.723531          1735
3   Adam Sheppard    270 Williar Court  Treena  ...   North Dakota 2011-03-30 19:16:29.015598          1503
4  Penney Allison     592 Oakdale Road    Chas  ...          Maine 2009-12-14 16:31:37.714933          1175

This approach keeps your code clean.

Share:
10,818

Related videos on Youtube

Conrad Addo
Author by

Conrad Addo

Updated on June 04, 2022

Comments

  • Conrad Addo
    Conrad Addo over 1 year

    I would like to use the Python Faker library to generate 500 lines of data, however I get repeated data using the code I came up with below. Can you please point out where I'm going wrong. I believe it has something to do with the for loop. Thanks in advance:

    from faker import Factory
    import pandas as pd
    import random
    
    def create_fake_stuff(fake):
    
    
    df = pd.DataFrame(columns=('name'
        , 'email'
        , 'bs'
        , 'address'
        , 'city'
        , 'state'
        , 'date_time'
        , 'paragraph'
        , 'Conrad'
        ,'randomdata'))
    
    stuff = [fake.name()
        , fake.email()
        , fake.bs()
        , fake.address()
        , fake.city()
        , fake.state()
        , fake.date_time()
        , fake.paragraph()
        , fake.catch_phrase()
        , random.randint(1000,2000)]
    
    for i in range(10):
            df.loc[i] = [item for item in stuff]
    print(df)
    
    if __name__ == '__main__':
        fake = Factory.create()
        create_fake_stuff(fake)
    
  • Alexei Martianov
    Alexei Martianov almost 4 years
    good point. if you care about speed, i've added my answer with another library, it's even more times faster
  • Eugene K
    Eugene K over 3 years
    To be honest, your benchmark is slightly off, if you run both in a loop, you will get very close numbers, (in my case it is 7 and 8 sec).