Pandas read_csv does not load a comma separated CSV properly
Solution 1
Attention!
The main issue was downloading the data. If you run a problem of loading and processing the Kaggle Titanic Dataset, you may re-download the CSV from here and re-run your program.
You can pass delimiter=','
:
df = pd.read_csv("Desktop/data/train.csv", delimiter=',')
print(df.head())
PassengerId Survived Pclass \
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
Name Sex Age SibSp \
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
print(df.columns)
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
Next, you can create a mapping of sorts:
mapping = {'male' : 0, 'female' : 1}
And you'll call pd.Series.replace
:
df.Sex = df.Sex.replace(mapping)
print(df.Sex)
0 0
1 1
2 1
3 1
4 0
Name: Sex, dtype: int64
Solution 2
Your read_csv looks fine, the replace in the same line seems to be causing trouble.
Try to first read the csv as is into the variable df. This way your code will be cleaner.
df = pd.read_csv('Desktop/data/train.csv',sep=',')
df['Sex'] = df['Sex'].map( {'female': 1, 'male': 0} )
But you can leave the sep argument altogether as comma is standard delimiter
Alternatively do the cleaning with replace on a new line after you read the file into df and use
inplace=True
:
df['Sex'].replace({'male': 0, 'female': 1}, inplace=True)
General advice:
Kaggle webpage supports script sharing and commenting in kernel section. Try to look at it to see how you can go about the analysis if you are stuck somewhere:
https://www.kaggle.com/c/titanic/kernels
Related videos on Youtube
user8385498
Updated on June 04, 2022Comments
-
user8385498 almost 2 years
Now,I analyze Titanic challenge of Kaggel. My code is this:
But my ideal output is:
So,in my last code is
df["Age"].fillna(df.Age.median(), inplace=True)
and error happens
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) /Users/XXXi/anaconda/envs/py36/lib/python3.6/site-packages/pandas/indexes/base.py in get_loc(self, key, method, tolerance) 2133 try: -> 2134 return self._engine.get_loc(key) 2135 except KeyError: pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4433)() pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4279)() pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742)() pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)() KeyError: 'Age' During handling of the above exception, another exception occurred: KeyError Traceback (most recent call last) <ipython-input-4-9763f0a9951c> in <module>() ----> 1 df["Age"].fillna(df.Age.median(), inplace=True) /Users/XXXi/anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key) 2057 return self._getitem_multilevel(key) 2058 else: -> 2059 return self._getitem_column(key) 2060 2061 def _getitem_column(self, key): /Users/XXXi/anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/frame.py in _getitem_column(self, key) 2064 # get column 2065 if self.columns.is_unique: -> 2066 return self._get_item_cache(key) 2067 2068 # duplicate columns & possible reduce dimensionality /Users/XXXi/anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/generic.py in _get_item_cache(self, item) 1384 res = cache.get(item) 1385 if res is None: -> 1386 values = self._data.get(item) 1387 res = self._box_item_values(item, values) 1388 cache[item] = res /Users/XXXi/anaconda/envs/py36/lib/python3.6/site-packages/pandas/core/internals.py in get(self, item, fastpath) 3541 3542 if not isnull(item): -> 3543 loc = self.items.get_loc(item) 3544 else: 3545 indexer = np.arange(len(self.items))[isnull(self.items)] /Users/XXXi/anaconda/envs/py36/lib/python3.6/site-packages/pandas/indexes/base.py in get_loc(self, key, method, tolerance) 2134 return self._engine.get_loc(key) 2135 except KeyError: -> 2136 return self._engine.get_loc(self._maybe_cast_indexer(key)) 2137 2138 indexer = self.get_indexer([key], method=method, tolerance=tolerance) pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4433)() pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4279)() pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742)() pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)() KeyError: 'Age'
I use
sep=','
so I really cannot understand why this code cannot separate in each comma.How can I fix this? -
user8385498 over 6 yearsthx, ur comments.Your means, df= pd.read_csv("Desktop/data/train.csv",sep=',',inplace=True, regex=True).replace("male",0).replace("female",1) ,right?
-
user8385498 over 6 yearsBut using this code,I got an error, TypeError Traceback (most recent call last) <ipython-input-7-670579237ebd> in <module>() ----> 1 df= pd.read_csv("Desktop/data/train.csv",sep=',',inplace=True, regex=True).replace("male",0).replace("female",1) TypeError: parser_f() got an unexpected keyword argument 'inplace' , do i misunderstand ur message?
-
StefanK over 6 yearsI edited my original answer... do the cleaning on new line
-
user8385498 over 6 yearsUr message means
-
user8385498 over 6 yearsdf= pd.read_csv("Desktop/data/train.csv",sep=',',inplace=True, regex=True).replace("male",0).replace("female",1) df.cleaning(inplace=True, regex=True)
-
StefanK over 6 yearsTry just the two lines as presented. Or what coldspeed advices
-
StefanK over 6 yearsAccording to pandas documentation
delimiter
is just alternative name for sep: pandas.pydata.org/pandas-docs/stable/generated/… -
cs95 over 6 years@StefanK From my experience, there have been cases where I got things working by using a combination of them, or changing one to the other. I think they're complements for each other, not replacements. Edit: Changed my answer slightly.
-
user8385498 over 6 years@cᴏʟᴅsᴘᴇᴇᴅ Thx for ur answer.I followed ur message, but AttributeError happen.I updated my question, if u know something, please help me.
-
cs95 over 6 years@user8385498 Run all my code dude. You only followed the mapping part.
-
user8385498 over 6 yearsI followed ur codes precisely, but same error happens(I found one mistake before codes)I updated my question, if u found some mistake,please tell me.
-
cs95 over 6 years@user8385498 please use
df = pd.read_csv("Desktop/data/train.csv", delimiter=',')
, notsep=','
. If there are any more problems, runpip install --upgrade pandas
and try again. -
user8385498 over 6 yearsI cannot believe it!!!I run pip install --upgrade pandas, and tried it again, but same error happen.Is there a problem in my csv file?I updated it.
-
cs95 over 6 years@user8385498 Go to this link: storage.googleapis.com/kaggle-competitions-data/kaggle/3136/… and download again.
-
cs95 over 6 years@user8385498 Lmao, congratulations.