Leave dates as strings using read_excel function from pandas in python

16,464

Solution 1

I ran into an identical problem, except pandas was oddly converting only some cells into datetimes. I ended up manually converting each cell into a string like so:

def undate(x):
    if pd.isnull(x):
        return x
    try:
        return x.strftime('%d/%m/%Y')
    except AttributeError:
        return x
    except Exception:
        raise

for i in list_of_possible_date_columns:
    df[i] = df[i].apply(undate)

Solution 2

  • Using converters{'Date': str} option inside the pandas.read_excel which helps. pandas.read_excel(xlsx, sheet, converters={'Date': str})
  • you can try convert your timestamp back to the original format
    df['Date'][0].strftime('%Y/%m/%d')

Solution 3

I tried saving the file in a CSV UTF-8 format (manually) and used pd.read_csv() and worked fine.

I tried a bunch of things to figure the same thing with read_excel. Did not work anything for me. So, I am guessing read_excel is probably updating your string in a datetime object which you can not control.

Share:
16,464

Related videos on Youtube

MattB
Author by

MattB

Web Analytics Java / Javascript coder.

Updated on September 09, 2022

Comments

  • MattB
    MattB over 1 year

    Python 2.7.10
    Tried pandas 0.17.1 -- function read_excel
    Tried pyexcel 0.1.7 + pyexcel-xlsx 0.0.7 -- function get_records()

    When using pandas in Python is it possible to read excel files (formats: xls, xlsx) and leave columns containing date or date + time values as strings rather than auto-converting to datetime.datetime or timestamp types?

    If this is not possible using pandas can someone suggest an alternate method/library to read xls, xlsx files and leave date column values as strings?

    For the pandas solution attempts the df.info() and resulting date column types are shown below:

    >>> df.info()
    <class 'pandas.core.frame.DataFrame'>
    Int64Index: 117 entries, 0 to 116
    Columns: 176 entries, Mine to Index
    dtypes: datetime64[ns](2), float64(145), int64(26), object(3)
    memory usage: 161.8+ KB
    >>> type(df['Start Date'][0])
    Out[6]: pandas.tslib.Timestamp
    >>> type(df['End Date'][0])
    Out[7]: pandas.tslib.Timestamp
    

    Attempt/Approach 1:

    def read_as_dataframe(filename, ext):
       import pandas as pd
       if ext in ('xls', 'xlsx'):
          # problem: date columns auto converted to datetime.datetime or timestamp!
          df = pd.read_excel(filename) # unwanted - date columns converted!
    
       return df, name, ext
    

    Attempt/Approach 2:

    import pandas as pd
    # import datetime as datetime
    # parse_date = lambda x: datetime.strptime(x, '%Y%m%d %H')
    parse_date = lambda x: x
    elif ext in ('xls', 'xlsx', ):
        df = pd.read_excel(filename, parse_dates=False)
        date_cols = [df.columns.get_loc(c) for c in df.columns if c in ('Start Date', 'End Date')]
        # problem: date columns auto converted to datetime.datetime or timestamp!
        df = pd.read_excel(filename, parse_dates=date_cols, date_parser=parse_date)
    

    And have also tried pyexcel library but it does the same auto-magic convert behavior:

    Attempt/Approach 3:

    import pyexcel as pe
    import pyexcel.ext.xls
    import pyexcel.ext.xlsx
    
    t0 = time.time()
    if ext == 'xlsx':
        records = pe.get_records(file_name=filename)
        for record in records:
            print("start date = %s (type=%s), end date = %s (type=%s)" %
                  (record['Start Date'],
                   str(type(record['Start Date'])),
                   record['End Date'],
                   str(type(record['End Date'])))
                  )
    
    • MattB
      MattB over 8 years
      As you can see from my code snippets I am not passing date_cols or date_parser for first style approach above and pandas is auto-magically (and silently) converting the columns to datetime values. I need to find a way to stop that and leave any column containing a date value as a string.
    • MattB
      MattB over 8 years
      Thanks @EdChum will explore csv approach further. The spreadsheets themselves are out of my control (format + content) so I need to work with formats provided to me and I am trying to develop a robust custom method for trying different date formats (time zones) and presenting user with best fit.
    • joris
      joris over 8 years
      I think @EdChum is correct. The type you get depends on how the data is typed in the excel file itself. If the column is formatted as datetime data there, you will get datetime data from read_excel.
  • Amon
    Amon about 4 years
    I get why he didn't accept this as the answer but this works for me and is really easy. I don't think there's another method that pandas allows which is weird to me
  • FObersteiner
    FObersteiner about 2 years
    to make this a useful answer, could you please add some more details? for which specific input (excel file) does this give what desired output?