Using Python Faker generate different data for 5000 rows
Solution 1
I placed the fake stuff array inside my for loop to achieve the desired result:
for i in range(10):
stuff = [fake.name()
, fake.email()
, fake.bs()
, fake.address()
, fake.city()
, fake.state()
, fake.date_time()
, fake.paragraph()
, fake.catch_phrase()
, random.randint(1000, 2000)]
df.loc[i] = [item for item in stuff]
print(df)
Solution 2
Disclaimer: this answer is added much after the question and adds some new info not directly answering the question.
Now there is a fast new library Mimesis - Fake Data Generator.
- Upside: It is stated it works times faster than
faker
(see below my test of data similar to one in question). - Downside: works from 3.6 version of Python only.
pip install mimesis
>>> from mimesis import Person
>>> from mimesis.enums import Gender
>>> person = Person('en')
>>> person.full_name(gender=Gender.FEMALE)
'Antonetta Garrison'
>>> personru = Person('ru')
>>> personru.full_name()
'Рената Черкасова'
The same with developed earlier faker:
pip install faker
>>> from faker import Faker
>>> fake_ru=Faker('ja_JP')
>>> fake_ru=Faker('ru_RU')
>>> fake_jp=Faker('ja_JP')
>>> print (fake_ru.name())
Субботина Елена Наумовна
>>> print (fake_jp.name())
大垣 花子
Below it my recent timing of Mimesis vs. Faker based on code provided in answer from forzer0eight:
from faker import Faker
import pandas as pd
import random
fake = Faker()
def create_rows_faker(num=1):
output = [{"name":fake.name(),
"address":fake.address(),
"name":fake.name(),
"email":fake.email(),
#"bs":fake.bs(),
"city":fake.city(),
"state":fake.state(),
"date_time":fake.date_time(),
#"paragraph":fake.paragraph(),
#"Conrad":fake.catch_phrase(),
"randomdata":random.randint(1000,2000)} for x in range(num)]
return output
%%time
df_faker = pd.DataFrame(create_rows_faker(5000))
CPU times: user 3.51 s, sys: 2.86 ms, total: 3.51 s Wall time: 3.51 s
from mimesis import Person
from mimesis import Address
from mimesis.enums import Gender
from mimesis import Datetime
person = Person('en')
import pandas as pd
import random
person = Person()
addess = Address()
datetime = Datetime()
def create_rows_mimesis(num=1):
output = [{"name":person.full_name(gender=Gender.FEMALE),
"address":addess.address(),
"name":person.name(),
"email":person.email(),
#"bs":person.bs(),
"city":addess.city(),
"state":addess.state(),
"date_time":datetime.datetime(),
#"paragraph":person.paragraph(),
#"Conrad":person.catch_phrase(),
"randomdata":random.randint(1000,2000)} for x in range(num)]
return output
%%time
df_mimesis = pd.DataFrame(create_rows_mimesis(5000))
CPU times: user 178 ms, sys: 1.7 ms, total: 180 ms Wall time: 179 ms
Below is resulting data for comparison:
df_faker.head(2)
address city date_time email name randomdata state
0 3818 Goodwin Haven\nBrocktown, GA 06168 Valdezport 2004-10-18 20:35:52 [email protected] Deborah Garcia 1218 Oklahoma
1 2568 Gonzales Field\nRichardhaven, NC 79149 West Rachel 1985-02-03 00:33:00 [email protected] Barbara Pineda 1536 Tennessee
df_mimesis.head(2)
address city date_time email name randomdata state
0 351 Nobles Viaduct Cedar Falls 2013-08-22 08:20:25.288883 [email protected] Ernest 1673 Georgia
1 517 Williams Hill Malden 2008-01-26 18:12:01.654995 [email protected] Jonathan 1845 North Dakota
Solution 3
Following scripts can remarkably enhance the pandas performance.
from faker import Faker
import pandas as pd
import random
fake = Faker()
def create_rows(num=1):
output = [{"name":fake.name(),
"address":fake.address(),
"name":fake.name(),
"email":fake.email(),
"bs":fake.bs(),
"address":fake.address(),
"city":fake.city(),
"state":fake.state(),
"date_time":fake.date_time(),
"paragraph":fake.paragraph(),
"Conrad":fake.catch_phrase(),
"randomdata":random.randint(1000,2000)} for x in range(num)]
return output
It takes 5.55s.
%%time
df = pd.DataFrame(create_rows(5000))
Wall time: 5.55 s
Solution 4
Using the farsante and mimesis libraries is the easiest way to create Pandas DataFrames with fake data.
import random
import farsante
from mimesis import Person
from mimesis import Address
from mimesis import Datetime
person = Person()
address = Address()
datetime = Datetime()
def rand_int(min_int, max_int):
def some_rand_int():
return random.randint(min_int, max_int)
return some_rand_int
df = farsante.pandas_df([
person.full_name,
address.address,
person.name,
person.email,
address.city,
address.state,
datetime.datetime,
rand_int(1000, 2000)], 5)
print(df)
full_name address name ... state datetime some_rand_int
0 Weldon Durham 1027 Nellie Square Bruna ... West Virginia 2030-06-10 09:21:29.179412 1453
1 Veta Conrad 932 Cragmont Arcade Betsey ... Iowa 2017-08-11 23:50:27.479281 1909
2 Vena Kinney 355 Edgar Highway Tyson ... New Hampshire 2002-12-21 05:26:45.723531 1735
3 Adam Sheppard 270 Williar Court Treena ... North Dakota 2011-03-30 19:16:29.015598 1503
4 Penney Allison 592 Oakdale Road Chas ... Maine 2009-12-14 16:31:37.714933 1175
This approach keeps your code clean.
Related videos on Youtube
Conrad Addo
Updated on June 04, 2022Comments
-
Conrad Addo over 1 year
I would like to use the Python Faker library to generate 500 lines of data, however I get repeated data using the code I came up with below. Can you please point out where I'm going wrong. I believe it has something to do with the for loop. Thanks in advance:
from faker import Factory import pandas as pd import random def create_fake_stuff(fake): df = pd.DataFrame(columns=('name' , 'email' , 'bs' , 'address' , 'city' , 'state' , 'date_time' , 'paragraph' , 'Conrad' ,'randomdata')) stuff = [fake.name() , fake.email() , fake.bs() , fake.address() , fake.city() , fake.state() , fake.date_time() , fake.paragraph() , fake.catch_phrase() , random.randint(1000,2000)] for i in range(10): df.loc[i] = [item for item in stuff] print(df) if __name__ == '__main__': fake = Factory.create() create_fake_stuff(fake)
-
Alexei Martianov almost 4 yearsgood point. if you care about speed, i've added my answer with another library, it's even more times faster
-
Eugene K over 3 yearsTo be honest, your benchmark is slightly off, if you run both in a loop, you will get very close numbers, (in my case it is 7 and 8 sec).