Dask read_csv-- Mismatched dtypes found in `pd.read_csv`/`pd.read_table`

python dataframe dask

12,661

Solution 1

The message is suggesting that your change your call from

df = dd.read_csv('mylocation.csv', ...)

df = dd.read_csv('mylocation.csv', ..., dtype={'ARTICLE_ID': 'object'})

where you should change the file location and any other arguments to what you were using before. If this still doesn't work, then please update your question.

Solution 2

You can use sample parameter in read_csv method and assign it an integer to indicate the number of bytes to use when determining dtypes. For example, I had to give it 25000000 to correctly infer the types of my data in the shape of (171907, 161).

df = dd.read_csv("game_logs.csv", sample=25000000)

https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_csv

12,661

Author by

Coffey Liu

Updated on June 05, 2022

Comments

Coffey Liu about 2 years

I'm trying to use dask to read csv file, and it gave me an error like below. But the thing is I want my ARTICLE_ID be object(string). Anyone can help me to read data successfully?

Traceback is like below:

ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.

+------------+--------+----------+

| Column     | Found  | Expected |

+------------+--------+----------+

| ARTICLE_ID | object | int64    |

+------------+--------+----------+

The following columns also raised exceptions on conversion:

ARTICLE_ID:


ValueError("invalid literal for int() with base 10: ' July 2007 and 31 March 2008. Diagnostic practices of the medical practitioners for establishing the diagnosis of different types of EPTB were studied. Results: For the diagnosi\\\\'",)

Usually this is due to dask's dtype inference failing, and
*may* be fixed by specifying dtypes manually by adding:

dtype={'ARTICLE_ID': 'object'}

to the call to `read_csv`/`read_table`.

Pyd over 5 years

when you have multiple files in a folder, and you are reading all the files in a folder, some of the files will not have the particular column. How will you handle that case?
Varsha almost 4 years

Should I explicitly specify dtypes for each column? I have encountered same problem when reading a large file with >1000 columns.
mdurant almost 4 years

For missing columns, I think you will need to write your own loader using dask.delayed
SeF over 2 years

Also dtype='object' as argument of read_csv works.