Generating random dates within a given range in pandas

21,033

Solution 1

np.random.randn + to_timedelta

This addresses Case (1). You can do this by generating a random array of timedelta objects and adding them to your start date.

def random_dates(start, end, n, unit='D', seed=None):
    if not seed:  # from piR's answer
        np.random.seed(0)

    ndays = (end - start).days + 1
    return pd.to_timedelta(np.random.rand(n) * ndays, unit=unit) + start

>>> np.random.seed(0)
>>> start = pd.to_datetime('2015-01-01')
>>> end = pd.to_datetime('2018-01-01')
>>> random_dates(start, end, 10)
DatetimeIndex([   '2016-08-25 01:09:42.969600',
                  '2017-02-23 13:30:20.304000',
                  '2016-10-23 05:33:15.033600',
               '2016-08-20 17:41:04.012799999',
               '2016-04-09 17:59:00.815999999',
                  '2016-12-09 13:06:00.748800',
                  '2016-04-25 00:47:45.974400',
                  '2017-09-05 06:35:58.444800',
                  '2017-11-23 03:18:47.347200',
                  '2016-02-25 15:14:53.894400'],
              dtype='datetime64[ns]', freq=None)

This will generate dates with a time component as well.

Sadly, rand does not support a replace=False, so if you want unique dates, you'll need a two-step process of 1) generate the non-unique days component, and 2) generate the unique seconds/milliseconds component, then add the two together.


np.random.randint + to_timedelta

This addresses Case (2). You can modify random_dates above to generate random integers instead of random floats:

def random_dates2(start, end, n, unit='D', seed=None):
    if not seed:  # from piR's answer
        np.random.seed(0)

    ndays = (end - start).days + 1
    return start + pd.to_timedelta(
        np.random.randint(0, ndays, n), unit=unit
    )

>>> random_dates2(start, end, 10)
DatetimeIndex(['2016-11-15', '2016-07-13', '2017-04-15', '2017-02-02',
               '2017-10-30', '2015-10-05', '2016-08-22', '2017-12-30',
               '2016-08-23', '2015-11-11'],
              dtype='datetime64[ns]', freq=None)

To generate dates with other frequencies, the functions above can be called with a different value for unit. Additionally, you can add a parameter freq and tweak your function call as needed.

If you want unique random dates, you can use np.random.choice with replace=False:

def random_dates2_unique(start, end, n, unit='D', seed=None):
    if not seed:  # from piR's answer
        np.random.seed(0)

    ndays = (end - start).days + 1
    return start + pd.to_timedelta(
        np.random.choice(ndays, n, replace=False), unit=unit
    )

Performance

Going to benchmark just the methods that address Case (1), since Case (2) is really a special case which any method can get to using dt.floor.

enter image description here Functions

def cs(start, end, n):
    ndays = (end - start).days + 1
    return pd.to_timedelta(np.random.rand(n) * ndays, unit='D') + start

def akilat90(start, end, n):
    start_u = start.value//10**9
    end_u = end.value//10**9

    return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')

def piR(start, end, n):
    dr = pd.date_range(start, end, freq='H') # can't get better than this :-(
    return pd.to_datetime(np.sort(np.random.choice(dr, n, replace=False)))

def piR2(start, end, n):
    dr = pd.date_range(start, end, freq='H')
    a = np.arange(len(dr))
    b = np.sort(np.random.permutation(a)[:n])
    return dr[b]

Benchmarking Code

from timeit import timeit

import pandas as pd
import matplotlib.pyplot as plt

res = pd.DataFrame(
       index=['cs', 'akilat90', 'piR', 'piR2'],
       columns=[10, 20, 50, 100, 200, 500, 1000, 2000, 5000],
       dtype=float
)

for f in res.index: 
    for c in res.columns:
        np.random.seed(0)

        start = pd.to_datetime('2015-01-01')
        end = pd.to_datetime('2018-01-01')

        stmt = '{}(start, end, c)'.format(f)
        setp = 'from __main__ import start, end, c, {}'.format(f)
        res.at[f, c] = timeit(stmt, setp, number=30)

ax = res.div(res.min()).T.plot(loglog=True) 
ax.set_xlabel("N"); 
ax.set_ylabel("time (relative)");

plt.show()

Solution 2

Just my two cents, using date_range and sample:

def random_dates(start, end, n, seed=1, replace=False):
    dates = pd.date_range(start, end).to_series()
    return dates.sample(n, replace=replace, random_state=seed)

random_dates("20170101","20171223", 10, seed=1)
Out[29]: 
2017-10-01   2017-10-01
2017-08-23   2017-08-23
2017-11-30   2017-11-30
2017-06-15   2017-06-15
2017-11-18   2017-11-18
2017-10-31   2017-10-31
2017-07-31   2017-07-31
2017-03-07   2017-03-07
2017-09-09   2017-09-09
2017-10-15   2017-10-15
dtype: datetime64[ns]
Share:
21,033
cs95
Author by

cs95

Hi Magi the pirate My pandas canonicals ⬅ please critique! and/or buy me a drink ;-) When asking a question here, please describe the problem you are trying to solve, rather than asking about the method you think is the solution to that problem. This is known as the XY problem and can easily be averted if you ask the right questions. This will make it easy for us to understand what you're trying to do, and help you arrive at the best solution for your problem even sooner. Rep = participation Badges = quality 1 like = 1 prayer 1 upvote = 1 thanks Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. (Actively Maintained) Pandas Canonicals The full list is on GitHub. Pandas Merging 101: Everything you ever wanted to know about merging and more How to iterate over a DataFrame. Spoler alert: don't! How to convert a DataFrame to NumPy array: from 0.24, use df.to_numpy() The Right Way to create an empty DataFrame and fill it: Use a list to grow your data, not a DataFrame Don't use inplace=True! Other Posts I'm Proud Of Pandas Merging 101 (see above) Best way to interleave two lists How do "and" and "or" act with non-boolean values? So you think you know recursion? In addition to this, some of my answers that use perfplot are worth a read. 236th awardee of the Legendary badge. Thanks, Stack Overflow! I love the shirt :-) & [SPACE RESERVED FOR 250k SWAG]

Updated on February 13, 2020

Comments

  • cs95
    cs95 over 4 years

    This is a self-answered post. A common problem is to randomly generate dates between a given start and end date.

    There are two cases to consider:

    1. random dates with a time component, and
    2. random dates without time

    For example, given some start date 2015-01-01 and an end date 2018-01-01, how can I sample N random dates between this range using pandas?

  • akilat90
    akilat90 almost 6 years
    @coldspeed thanks! The constant time seems a bit fishy to me though. I wonder if anyone has an explanation.
  • cs95
    cs95 almost 6 years
    @akilat90 It's relative time (loglog). "My answer is twice as slow as yours, piR's answer is .5 times as slow as yours"... etc.
  • akilat90
    akilat90 almost 6 years
    Ah! relative. Got it. :)
  • akilat90
    akilat90 almost 6 years
    @coldspeed My second favorite thing about this question is this benchmarking code. Perhaps add that to a tag wiki so that a wider audience can reuse it?
  • cs95
    cs95 almost 6 years
    @akilat90 I recently discovered something similar that does exactly this. It's called perfplot. Can't believe I was constantly reinventing the wheel when something like this was already out there...
  • Jivan
    Jivan almost 2 years
    this should be the top answer