Why use infer_datetime_format when importing csv file?

11,596

The docs for pandas.read_csv suggest why:

infer_datetime_format : boolean, default False

If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by 5-10x.

Essentially, Pandas deduces the format of your datetime from the first element(s) and then assumes all other elements in the series will use the same format. This means Pandas does not need to check multiple formats when attempting to convert a string to datetime.

Remember, CSV files can only hold textual data, so a conversion to datetime (essentially a numeric type) will always be required.

Here's a demonstration:

from dateutil import parser
from datetime import datetime

L = ['2018-01-05', '2018-12-20', '2018-03-30', '2018-04-15']*5000

%timeit [parser.parse(i) for i in L]                   # 1.57 s
%timeit [datetime.strptime(i, '%Y-%m-%d') for i in L]  # 338 ms
Share:
11,596

Related videos on Youtube

rul30
Author by

rul30

What we usually consider as impossible are simply engineering problems... there's no law of physics preventing them. Michio Kaku

Updated on July 14, 2022

Comments

  • rul30
    rul30 almost 2 years

    Where is the process difference between:

    df=pd.read_csv(filename, parse_dates=[0], infer_datetime_format=True)
    

    and

    df=pd.read_csv(filename, parse_dates=[0])
    

    Why is the first import to be faster? Since parse_dates already specifies where to look for a date.

  • rul30
    rul30 almost 6 years
    does that mean, that without this command pandas would try to check for ever "row" if and if yes, than which datetime format is used in each row?
  • jpp
    jpp almost 6 years
    Not sure what you mean. But without inferring, it will try multiple formats for each row until one works. It's not efficient.
  • rul30
    rul30 almost 6 years
    Thanks for helping out here, I obviously need your help ;-) so does it mean that if a different date could show up this option should not be used? Maybe my question should have been, what is the downside of inferring?
  • jpp
    jpp almost 6 years
    Correct. If you have more than one format don't infer.