Pandas read_csv dtype specify all columns but one

15,983

Solution 1

You can first filter columns contains value X with boolean indexing and then replace:

cols = df.columns[df.columns.str.contains('X')]
df[cols] = df[cols].replace({'True': True, 'False': False})

Or if need filter column X:

cols = df.columns[df.columns == 'X']
df[cols] = df[cols].replace({'True': True, 'False': False})

Sample:

import pandas as pd

df = pd.DataFrame({'A':['a1','a2','a3'],
                   'B':['b1','b2','b3'],
                   'C':['c1','c2','c3'],
                   'X':['True','False','True']})

print (df)
    A   B   C      X
0  a1  b1  c1   True
1  a2  b2  c2  False
2  a3  b3  c3   True
print (df.dtypes)
A    object
B    object
C    object
X    object
dtype: object

cols = df.columns[df.columns.str.contains('X')]
print (cols)

Index(['X'], dtype='object')

df[cols] = df[cols].replace({'True': True, 'False': False})

print (df.dtypes)
A    object
B    object
C    object
X      bool
dtype: object
print (df)

    A   B   C      X
0  a1  b1  c1   True
1  a2  b2  c2  False
2  a3  b3  c3   True

Solution 2

why not use bool() data type. bool() evaluates to true if a parameter is passed and the parameter is not False, None, '', or 0

if 'X' in data.columns:
    try:
        l = bool(data.columns['X'].replace('False', 0))
    except:
        l = None
    data['X'] = data.apply(l, axis=1)
Share:
15,983
elaspog
Author by

elaspog

Updated on July 20, 2022

Comments

  • elaspog
    elaspog almost 2 years

    I've a CSV file. Most of it's values I want to read as string, but I want to read a column as bool if the column with the given title exists..

    Because the CSV file has a lots of columns, I don't want to specify on each column the datatype directly and give something like this:

    data = read_csv('sample.csv', dtype={'A': str, 'B': str, ..., 'X': bool})
    

    Is it possible to define the string type on each column but one and read an optional column as a bool at the same time?

    My current solution is the following (but it's very unefficient and slow):

    data = read_csv('sample.csv', dtype=str) # reads all column as string
    if 'X' in data.columns:
        l = lambda row: True if row['X'] == 'True' else False if row['X'] == 'False' else None
        data['X'] = data.apply(l, axis=1)
    

    UPDATE: Sample CSV:

    A;B;C;X
    a1;b1;c1;True
    a2;b2;c2;False
    a3;b3;c3;True
    

    Or the same can ba without the 'X' column (because the column is optional):

    A;B;C
    a1;b1;c1
    a2;b2;c2
    a3;b3;c3
    
  • elaspog
    elaspog almost 8 years
    I think applying the lambda expression slows down processing time.
  • elaspog
    elaspog almost 8 years
    Thank you! It's much faster than my solution!
  • TheLazyScripter
    TheLazyScripter almost 8 years
    Yes most definitely. If I were going to approach it the way you did originally then I would use an if/else statement. if data == 'False': l = False elif data == 'True': l = True else: l = None but I would use on of the answers that were provided. Assuming you have a large dataset I would time both solutions and use the fastest. Good luck
  • elaspog
    elaspog almost 8 years
    That solution has a very big problem.. It sets the False values to True values.. Any string with not zero length is parsed as True..
  • elaspog
    elaspog almost 8 years
    I wouldn't need it by default, but my data is very special (it's mixes alphanumeric numbers). That's one of the reasons why I have this limitation on some columns to keep special numeric numbers in string format.