Pandas reading NULL as a NaN float instead of str

12,703

For me works astype:

df[3] = df[3].astype(str)

for i in df[3]:
    print (type(i), i)

<class 'str'> nan
<class 'str'> h
<class 'str'> m

Another solution is use keep_default_na=False in read_csv:

import pandas as pd
from pandas.compat import StringIO

temp=u"""a,b,c,NULL,d
e,f,g,h,i
j,k,l,m,n"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp),  names=[0,1,2,3,4], keep_default_na=False)
print (df)
   0  1  2     3  4
0  a  b  c  NULL  d
1  e  f  g     h  i
2  j  k  l     m  n

for i in df[3]:
    print (type(i), i)
<class 'str'> NULL
<class 'str'> h
<class 'str'> m

Then is possible use na_values parameter if need parse NaN in numeric columns, but it has to be different e.g. NA:

import pandas as pd
from pandas.compat import StringIO

temp=u"""a,b,c,NULL,1
e,f,g,h,2
j,k,l,m,NA"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp),  names=[0,1,2,3,4], keep_default_na=False, na_values=['NA'])
print (df)
   0  1  2     3    4
0  a  b  c  NULL  1.0
1  e  f  g     h  2.0
2  j  k  l     m  NaN

for i in df[3]:
    print (type(i), i)
<class 'str'> NULL
<class 'str'> h
<class 'str'> m

for i in df[4]:
    print (type(i), i)
<class 'numpy.float64'> 1.0
<class 'numpy.float64'> 2.0
<class 'numpy.float64'> nan
Share:
12,703
alvas
Author by

alvas

食飽未?

Updated on June 21, 2022

Comments

  • alvas
    alvas about 2 years

    Given the file:

    $ cat test.csv 
    a,b,c,NULL,d
    e,f,g,h,i
    j,k,l,m,n
    

    Where the 3rd column is to be treated as str.

    When I did a string function on the column, pandas has read the NULL str as a NaN float:

    >>> import pandas as pd
    >>> df = pd.read_csv('test.csv', names=[0,1,2,3,4], dtype={0:str, 1:str, 2:str, 3:str, 4:str})
    
    >>> df[3].apply(str.strip)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/local/lib/python3.5/site-packages/pandas/core/series.py", line 2355, in apply
        mapped = lib.map_infer(values, f, convert=convert_dtype)
      File "pandas/_libs/src/inference.pyx", line 1569, in pandas._libs.lib.map_infer (pandas/_libs/lib.c:66440)
    TypeError: descriptor 'strip' requires a 'str' object but received a 'float'
    

    To verify:

    >>> for i in df[3]:
    ...    print (type(i), i)
    ... 
    <class 'float'> nan
    <class 'str'> h
    <class 'str'> m
    

    I've specified the dtype at initialization but somehow it got overriden.

    How do I force the type of a specific column to be fixed?

    Is there a way of automatically finding these abnormal NaN floats and change then back to 'NULL' string?