Can pandas automatically read dates from a CSV file?

301,679

Solution 1

You should add parse_dates=True, or parse_dates=['column name'] when reading, thats usually enough to magically parse it. But there are always weird formats which need to be defined manually. In such a case you can also add a date parser function, which is the most flexible way possible.

Suppose you have a column 'datetime' with your string, then:

from datetime import datetime
dateparse = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')

df = pd.read_csv(infile, parse_dates=['datetime'], date_parser=dateparse)

This way you can even combine multiple columns into a single datetime column, this merges a 'date' and a 'time' column into a single 'datetime' column:

dateparse = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S')

df = pd.read_csv(infile, parse_dates={'datetime': ['date', 'time']}, date_parser=dateparse)

You can find directives (i.e. the letters to be used for different formats) for strptime and strftime in this page.

Solution 2

Perhaps the pandas interface has changed since @Rutger answered, but in the version I'm using (0.15.2), the date_parser function receives a list of dates instead of a single value. In this case, his code should be updated like so:

from datetime import datetime
import pandas as pd

dateparse = lambda dates: [datetime.strptime(d, '%Y-%m-%d %H:%M:%S') for d in dates]
    
df = pd.read_csv('test.dat', parse_dates=['datetime'], date_parser=dateparse)

Since the original question asker said he wants dates and the dates are in 2013-6-4 format, the dateparse function should really be:

dateparse = lambda dates: [datetime.strptime(d, '%Y-%m-%d').date() for d in dates]

Solution 3

You could use pandas.to_datetime() as recommended in the documentation for pandas.read_csv():

If a column or index contains an unparseable date, the entire column or index will be returned unaltered as an object data type. For non-standard datetime parsing, use pd.to_datetime after pd.read_csv.

Demo:

>>> D = {'date': '2013-6-4'}
>>> df = pd.DataFrame(D, index=[0])
>>> df
       date
0  2013-6-4
>>> df.dtypes
date    object
dtype: object
>>> df['date'] = pd.to_datetime(df.date, format='%Y-%m-%d')
>>> df
        date
0 2013-06-04
>>> df.dtypes
date    datetime64[ns]
dtype: object

Solution 4

When merging two columns into a single datetime column, the accepted answer generates an error (pandas version 0.20.3), since the columns are sent to the date_parser function separately.

The following works:

def dateparse(d,t):
    dt = d + " " + t
    return pd.datetime.strptime(dt, '%d/%m/%Y %H:%M:%S')

df = pd.read_csv(infile, parse_dates={'datetime': ['date', 'time']}, date_parser=dateparse)

Solution 5

pandas read_csv method is great for parsing dates. Complete documentation at http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html

you can even have the different date parts in different columns and pass the parameter:

parse_dates : boolean, list of ints or names, list of lists, or dict
If True -> try parsing the index. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a
separate date column. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date
column. {‘foo’ : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’

The default sensing of dates works great, but it seems to be biased towards north american Date formats. If you live elsewhere you might occasionally be caught by the results. As far as I can remember 1/6/2000 means 6 January in the USA as opposed to 1 Jun where I live. It is smart enough to swing them around if dates like 23/6/2000 are used. Probably safer to stay with YYYYMMDD variations of date though. Apologies to pandas developers,here but i have not tested it with local dates recently.

you can use the date_parser parameter to pass a function to convert your format.

date_parser : function
Function to use for converting a sequence of string columns to an array of datetime
instances. The default uses dateutil.parser.parser to do the conversion.
Share:
301,679

Related videos on Youtube

Roman
Author by

Roman

Updated on July 08, 2022

Comments

  • Roman
    Roman almost 2 years

    Today I was positively surprised by the fact that while reading data from a data file (for example) pandas is able to recognize types of values:

    df = pandas.read_csv('test.dat', delimiter=r"\s+", names=['col1','col2','col3'])
    

    For example it can be checked in this way:

    for i, r in df.iterrows():
        print type(r['col1']), type(r['col2']), type(r['col3'])
    

    In particular integer, floats and strings were recognized correctly. However, I have a column that has dates in the following format: 2013-6-4. These dates were recognized as strings (not as python date-objects). Is there a way to "learn" pandas to recognized dates?

  • Arya McCarthy
    Arya McCarthy about 7 years
    I think you misunderstand the question. The user is curious whether the option could be enabled for his format of string.
  • Jean Paul
    Jean Paul about 7 years
    Did not work for me, I got the following error: TypeError: strptime() argument 1 must be str, not float
  • Jean Paul
    Jean Paul about 7 years
    I got this error because there were nan in my data frame.
  • Gaurav
    Gaurav almost 7 years
    @AryaMcCarthy umm, he basically wants the date to be recognized correctly, so I am mentioning how can he transform the source data so that it is naturally recognized by pandas. Nowhere does he mention he cannot change the format of the source data.
  • Amir
    Amir over 6 years
    can you add an item that also NaTs the non-parsable material or NaN or /Ns. cause it seems this parser totally skips the whole column if anything like that is present
  • Will Gordon
    Will Gordon over 6 years
    You can specify dayfirst as True for European/international dates. pandas.pydata.org/pandas-docs/stable/generated/…
  • Dai
    Dai about 6 years
    I'm using pandas 0.22 and agree that the accepted answer no longer works.
  • Winand
    Winand over 5 years
    There's an option infer_datetime_format: "pandas will attempt to infer the format of the datetime strings in the columns". This can be used instead of date_parser.
  • IceQueeny
    IceQueeny over 5 years
    This creates a "TypeError: can only concatenate str (not "float") to str" for me. Date column is d/m/y and time column is H:M:00
  • Mr_and_Mrs_D
    Mr_and_Mrs_D over 5 years
    Note that if your dates are in ISO 8601 format you should not pass infer_datetime_format or a parser function - it's much slower than letting pandas handle it (especially the latter). The dateformat in this answer falls into this category also
  • lstodd
    lstodd over 5 years
    Is a lambda function necessary here? I think we can just directly apply the function to the date column.
  • Rutger Kassies
    Rutger Kassies over 5 years
    As mentioned in the post, it's necessary if you want to deal with a specific date format. Not specifying anything makes assumptions about the format you provide.
  • BobbyF
    BobbyF over 5 years
    I am using the same format for parsing my date column (only difference is that I use ''%Y-%m-%d'). It does the parsing and reads the csv into dataframe. But still my column is of type string instead of datetime. What am I missing? \
  • BobbyF
    BobbyF over 5 years
    I stopped the environment and restarted and it worked!!
  • ifly6
    ifly6 about 5 years
    Adding parse_dates=True doesn't work, where 'date' is the column and the dates are stored as YYYY-MM-DD.
  • royalyadnesh
    royalyadnesh over 4 years
    Getting this error: dateparse takes 1 positional argument but 2 were given, I am trying to do it for multiple columns.
  • Septacle
    Septacle about 4 years
    @royalyadnesh You need something like dateparse = lambda x,y : ...
  • ratnesh
    ratnesh about 4 years
    it is converting other columns to date too, which are of object type
  • Ébe Isaac
    Ébe Isaac almost 4 years
    pd.datetime is currently deprecated, replace pd.datetime with just datetime after import datetime from datetime.
  • cglacet
    cglacet over 3 years
    Maybe add a remark on dtype, the type you need to specify for read_csv argument is the input type and not the output type. So its datetime={'datetime': 'string'} and not datetime={'datetime': 'datetime64'}. If you do that the output type will be correct (datetime64[ns]).
  • saga
    saga over 3 years
    Why need to use a function instead of date_parser=datetime.strptime(x, '%Y-%m-%d %H:%M:%S') directly?
  • Raman Joshi
    Raman Joshi almost 3 years
    Can I use multiple date format check with or condition?
  • Raman Joshi
    Raman Joshi almost 3 years
    What if I don't know the column name when uploading csv file. I want some generic solution to auto detect date column when uploading csv file with some pre-defined date formats.
  • Cam
    Cam almost 3 years
    Have you tried the infer_datetime_format parameter for read_csv. It infers the dates which is fine if your dates are consistently formatted. It speeds up the process.
  • VLRoyrenn
    VLRoyrenn almost 3 years
    I would have, but my dates were in a weird format with the month first and the milliseconds part being separated by a third colon instead of a dot. At any rate, I figured it would still be useful to have this method written somewhere, since most sources either use inference or date_parser, with no fast alternative to date_format being given. IIRC the doc is also vague on how multiple date columns are combined when passed as an array to parse_dates (they get separated by spaces).
  • Eric Duminil
    Eric Duminil over 2 years
    It works fine, thanks. Note : dateparse = lambda x: datetime.strptime(x, '%Y-%m-%d %H:%M:%S'). Lambdas are anonymous functions. If you name an anonymous function, you might as well just use a function. So either pass a lambda directly as argument to read_csv, or define a dateparse function the usual way.