Dask read_csv-- Mismatched dtypes found in `pd.read_csv`/`pd.read_table`
12,661
Solution 1
The message is suggesting that your change your call from
df = dd.read_csv('mylocation.csv', ...)
to
df = dd.read_csv('mylocation.csv', ..., dtype={'ARTICLE_ID': 'object'})
where you should change the file location and any other arguments to what you were using before. If this still doesn't work, then please update your question.
Solution 2
You can use sample
parameter in read_csv
method and assign it an integer to indicate the number of bytes to use when determining dtypes. For example, I had to give it 25000000 to correctly infer the types of my data in the shape of (171907, 161).
df = dd.read_csv("game_logs.csv", sample=25000000)
https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_csv
Author by
Coffey Liu
Updated on June 05, 2022Comments
-
Coffey Liu about 2 years
I'm trying to use dask to read csv file, and it gave me an error like below. But the thing is I want my
ARTICLE_ID
beobject(string)
. Anyone can help me to read data successfully?Traceback is like below:
ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`. +------------+--------+----------+ | Column | Found | Expected | +------------+--------+----------+ | ARTICLE_ID | object | int64 | +------------+--------+----------+ The following columns also raised exceptions on conversion: ARTICLE_ID: ValueError("invalid literal for int() with base 10: ' July 2007 and 31 March 2008. Diagnostic practices of the medical practitioners for establishing the diagnosis of different types of EPTB were studied. Results: For the diagnosi\\\\'",) Usually this is due to dask's dtype inference failing, and *may* be fixed by specifying dtypes manually by adding: dtype={'ARTICLE_ID': 'object'} to the call to `read_csv`/`read_table`.
-
Pyd over 5 yearswhen you have multiple files in a folder, and you are reading all the files in a folder, some of the files will not have the particular column. How will you handle that case?
-
Varsha almost 4 yearsShould I explicitly specify dtypes for each column? I have encountered same problem when reading a large file with >1000 columns.
-
mdurant almost 4 yearsFor missing columns, I think you will need to write your own loader using dask.delayed
-
SeF over 2 yearsAlso
dtype='object'
as argument of read_csv works.