Setting column types while reading csv with pandas

53,155

Solution 1

In your loop you are doing:

for col in dp.columns:
    print 'column', col,':', type(col[0])

and you are correctly seeing str as the output everywhere because col[0] is the first letter of the name of the column, which is a string.

For example, if you run this loop:

for col in dp.columns:
    print 'column', col,':', col[0]

you will see the first letter of the string of each column name is printed out - this is what col[0] is.

Your loop only iterates on the column names, not on the series data.

What you really want is to check the type of each column's data (not its header or part of its header) in a loop.

So do this instead to get the types of the column data (non-header data):

for col in dp.columns:
    print 'column', col,':', type(dp[col][0])

This is similar to what you did when printing the type of the rating column separately.

Solution 2

Use:

dp.info()

to see the datatypes of the columns. dp.columns refers to the column header names, which are strings.

Solution 3

I think you should check this one first: Pandas: change data type of columns

when google pandas dataframe column type, it's on the top 5 answers.

Share:
53,155
user2738815
Author by

user2738815

Updated on June 22, 2020

Comments

  • user2738815
    user2738815 almost 4 years

    Trying to read csv file into pandas dataframe with the following formatting

    dp = pd.read_csv('products.csv', header = 0,  dtype = {'name': str,'review': str,
                                                          'rating': int,'word_count': dict}, engine = 'c')
    print dp.shape
    for col in dp.columns:
        print 'column', col,':', type(col[0])
    print type(dp['rating'][0])
    dp.head(3)
    

    This is the output:

    (183531, 4)
    column name : <type 'str'>
    column review : <type 'str'>
    column rating : <type 'str'>
    column word_count : <type 'str'>
    <type 'numpy.int64'>
    

    enter image description here

    I can sort of understand that pandas might be finding it difficult to convert a string representation of a dictionary into a dictionary given this and this. But how can the content of the "rating" column be both str and numpy.int64???

    By the way, tweaks like not specifying an engine or header do not change anything.

    Thanks and regards

  • user2738815
    user2738815 about 8 years
    Thanks, that was a slip on my part :) I am choosing this as the accepted answer because it is a direct response to my question.
  • user2738815
    user2738815 about 8 years
    Another shortcut I missed in the very dense pandas documentation--thank you.
  • user2738815
    user2738815 about 8 years
    Thank you, that is useful. I wish there were also a discussion of how to force conversion into dict type, as well (if there is one).
  • Colonel Beauvel
    Colonel Beauvel about 8 years
    I guess it was a typo, sometimes hard to detect when focused on the code ;)
  • lb_so
    lb_so over 3 years
    I don't think that is an answer to this question though - the question requires setting a column type during the read_csv process. Doing it post fact may be highly undesirable in a given use case. Good link though.