Pandas read_csv does not load a comma separated CSV properly

17,176

Solution 1

Attention!

The main issue was downloading the data. If you run a problem of loading and processing the Kaggle Titanic Dataset, you may re-download the CSV from here and re-run your program.


You can pass delimiter=',':

df = pd.read_csv("Desktop/data/train.csv", delimiter=',')
print(df.head())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


print(df.columns)

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

Next, you can create a mapping of sorts:

mapping = {'male' : 0, 'female' : 1}

And you'll call pd.Series.replace:

df.Sex = df.Sex.replace(mapping)
print(df.Sex)

0    0
1    1
2    1
3    1
4    0
Name: Sex, dtype: int64

Solution 2

Your read_csv looks fine, the replace in the same line seems to be causing trouble.

Try to first read the csv as is into the variable df. This way your code will be cleaner.

df = pd.read_csv('Desktop/data/train.csv',sep=',')
df['Sex'] = df['Sex'].map( {'female': 1, 'male': 0} )

But you can leave the sep argument altogether as comma is standard delimiter

Alternatively do the cleaning with replace on a new line after you read the file into df and use inplace=True:

df['Sex'].replace({'male': 0, 'female': 1}, inplace=True)

General advice:

Kaggle webpage supports script sharing and commenting in kernel section. Try to look at it to see how you can go about the analysis if you are stuck somewhere:

https://www.kaggle.com/c/titanic/kernels

Share:
17,176

Related videos on Youtube

user8385498
Author by

user8385498

Updated on June 04, 2022

Comments

  • user8385498
    user8385498 almost 2 years

    Now,I analyze Titanic challenge of Kaggel. My code is this: code

    But my ideal output is: ideal output

    So,in my last code is

    df["Age"].fillna(df.Age.median(), inplace=True)
    

    and error happens

    ---------------------------------------------------------------------------
    KeyError                                  Traceback (most recent call last)
    /Users/XXXi/anaconda/envs/py36/lib/python3.6/site-packages/pandas/indexes/base.py in get_loc(self, key, method, tolerance)
       2133             try:
    -> 2134                 return self._engine.get_loc(key)
       2135             except KeyError:
    
    pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4433)()
    
    pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4279)()
    
    pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742)()
    
    pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)()
    
    KeyError: 'Age'
    
    During handling of the above exception, another exception occurred:
    
    KeyError                                  Traceback (most recent call last)
    <ipython-input-4-9763f0a9951c> in <module>()
    ----> 1 df["Age"].fillna(df.Age.median(), inplace=True)
    
    /Users/XXXi/anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key)
       2057             return self._getitem_multilevel(key)
       2058         else:
    -> 2059             return self._getitem_column(key)
       2060 
       2061     def _getitem_column(self, key):
    
    /Users/XXXi/anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/frame.py in _getitem_column(self, key)
       2064         # get column
       2065         if self.columns.is_unique:
    -> 2066             return self._get_item_cache(key)
       2067 
       2068         # duplicate columns & possible reduce dimensionality
    
    /Users/XXXi/anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
      1384         res = cache.get(item)
       1385         if res is None:
    -> 1386             values = self._data.get(item)
       1387             res = self._box_item_values(item, values)
       1388             cache[item] = res
    
    /Users/XXXi/anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/internals.py in get(self, item, fastpath)
       3541 
       3542             if not isnull(item):
    -> 3543                 loc = self.items.get_loc(item)
       3544             else:
       3545                 indexer = np.arange(len(self.items))[isnull(self.items)]
    
    /Users/XXXi/anaconda/envs/py36/lib/python3.6/site-packages/pandas/indexes/base.py in get_loc(self, key, method, tolerance)
       2134                 return self._engine.get_loc(key)
       2135             except KeyError:
    -> 2136                 return self._engine.get_loc(self._maybe_cast_indexer(key))
       2137 
       2138         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
    
    pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4433)()
    
    pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4279)()
    
    pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742)()
    
    pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)()
    
    KeyError: 'Age'
    

    I use sep=',' so I really cannot understand why this code cannot separate in each comma.How can I fix this?

    I followed one answer,but error happens (I do not know why) error

    My data is data

  • user8385498
    user8385498 over 6 years
    thx, ur comments.Your means, df= pd.read_csv("Desktop/data/train.csv",sep=',',inplace=True, regex=True).replace("male",0).replace("female",1) ,right?
  • user8385498
    user8385498 over 6 years
    But using this code,I got an error, TypeError Traceback (most recent call last) <ipython-input-7-670579237ebd> in <module>() ----> 1 df= pd.read_csv("Desktop/data/train.csv",sep=',',inplace=True, regex=True).replace("male",0).replace("female",1) TypeError: parser_f() got an unexpected keyword argument 'inplace' , do i misunderstand ur message?
  • StefanK
    StefanK over 6 years
    I edited my original answer... do the cleaning on new line
  • user8385498
    user8385498 over 6 years
    Ur message means
  • user8385498
    user8385498 over 6 years
    df= pd.read_csv("Desktop/data/train.csv",sep=',',inplace=True, regex=True).replace("male",0).replace("female",1) df.cleaning(inplace=True, regex=True)
  • StefanK
    StefanK over 6 years
    Try just the two lines as presented. Or what coldspeed advices
  • StefanK
    StefanK over 6 years
    According to pandas documentation delimiter is just alternative name for sep: pandas.pydata.org/pandas-docs/stable/generated/…
  • cs95
    cs95 over 6 years
    @StefanK From my experience, there have been cases where I got things working by using a combination of them, or changing one to the other. I think they're complements for each other, not replacements. Edit: Changed my answer slightly.
  • user8385498
    user8385498 over 6 years
    @cᴏʟᴅsᴘᴇᴇᴅ Thx for ur answer.I followed ur message, but AttributeError happen.I updated my question, if u know something, please help me.
  • cs95
    cs95 over 6 years
    @user8385498 Run all my code dude. You only followed the mapping part.
  • user8385498
    user8385498 over 6 years
    I followed ur codes precisely, but same error happens(I found one mistake before codes)I updated my question, if u found some mistake,please tell me.
  • cs95
    cs95 over 6 years
    @user8385498 please use df = pd.read_csv("Desktop/data/train.csv", delimiter=','), not sep=','. If there are any more problems, run pip install --upgrade pandas and try again.
  • user8385498
    user8385498 over 6 years
    I cannot believe it!!!I run pip install --upgrade pandas, and tried it again, but same error happen.Is there a problem in my csv file?I updated it.
  • cs95
    cs95 over 6 years
    @user8385498 Go to this link: storage.googleapis.com/kaggle-competitions-data/kaggle/3136/‌​… and download again.
  • cs95
    cs95 over 6 years
    @user8385498 Lmao, congratulations.