Dask read_csv-- Mismatched dtypes found in `pd.read_csv`/`pd.read_table`

12,661

Solution 1

The message is suggesting that your change your call from

df = dd.read_csv('mylocation.csv', ...)

to

df = dd.read_csv('mylocation.csv', ..., dtype={'ARTICLE_ID': 'object'})

where you should change the file location and any other arguments to what you were using before. If this still doesn't work, then please update your question.

Solution 2

You can use sample parameter in read_csv method and assign it an integer to indicate the number of bytes to use when determining dtypes. For example, I had to give it 25000000 to correctly infer the types of my data in the shape of (171907, 161).

df = dd.read_csv("game_logs.csv", sample=25000000)

https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_csv

Share:
12,661
Coffey Liu
Author by

Coffey Liu

Updated on June 05, 2022

Comments

  • Coffey Liu
    Coffey Liu about 2 years

    I'm trying to use dask to read csv file, and it gave me an error like below. But the thing is I want my ARTICLE_ID be object(string). Anyone can help me to read data successfully?

    Traceback is like below:

    ValueError: Mismatched dtypes found in `pd.read_csv`/`pd.read_table`.
    
    +------------+--------+----------+
    
    | Column     | Found  | Expected |
    
    +------------+--------+----------+
    
    | ARTICLE_ID | object | int64    |
    
    +------------+--------+----------+
    
    The following columns also raised exceptions on conversion:
    
    ARTICLE_ID:
    
    
    ValueError("invalid literal for int() with base 10: ' July 2007 and 31 March 2008. Diagnostic practices of the medical practitioners for establishing the diagnosis of different types of EPTB were studied. Results: For the diagnosi\\\\'",)
    
    Usually this is due to dask's dtype inference failing, and
    *may* be fixed by specifying dtypes manually by adding:
    
    dtype={'ARTICLE_ID': 'object'}
    
    to the call to `read_csv`/`read_table`.
    
  • Pyd
    Pyd over 5 years
    when you have multiple files in a folder, and you are reading all the files in a folder, some of the files will not have the particular column. How will you handle that case?
  • Varsha
    Varsha almost 4 years
    Should I explicitly specify dtypes for each column? I have encountered same problem when reading a large file with >1000 columns.
  • mdurant
    mdurant almost 4 years
    For missing columns, I think you will need to write your own loader using dask.delayed
  • SeF
    SeF over 2 years
    Also dtype='object' as argument of read_csv works.