Reading large csv files with strings containing commas as one field

15,383

Solution 1

I'm not sure what is generating your CSV file but that is your problem.

The point of a CSV file, is that the file itself designates separation of fields. If the text of the CSV contains commas, then nothing you can do will help you. How would ANY program know when the text in a single field contains commas, or when that comma is a field delimiter?

Proper CSV would have a text qualifier. Some generators/readers gives you the option to use one. The standard text qualifier is a " (quote). Its changeable, though, because your text may contain those, too.

Again, its all about generating proper CSV content.

Solution 2

There's a chance that xlsread won't give you the answer you expect -- do the strings always appear in the same columns, for example? I think (as everyone else seems to :-) that it would be more robust to just use

fid = fopen('yourfile.csv');

and then either textscan

t = textscan(fid, '%s', delimiter', sprintf('\n'));
t = t{1};

or just fgetl (the example in the help is perfect).

After that you can do some line-by-line processing -- using textscan again on the text content of each line, for example, is a nice, quick way to get a cell-array that will allow fast analysis of each line.

Share:
15,383
K_U
Author by

K_U

Updated on June 05, 2022

Comments

  • K_U
    K_U almost 2 years

    I have a large .csv file (~26000 rows). I want to be able to read it into matlab. Another problem is that it contains a collection of strings delimited by commas in one of the fields.

    I'm having trouble reading it. I tried stuff like tdfread, which won't work here. Any tricks with textscan i should be aware about?

    Is there any other way?

  • DoctorLouie
    DoctorLouie over 14 years
    This wont help for excel will treat all rows as comma delimited values, and in essence will come up with extra columns.
  • tpow
    tpow over 14 years
    Actually, if you have it in an .xls you can have commas within your cells. At this point, you can run your functions to extract the commas, and then save as a .csv
  • K_U
    K_U over 14 years
    The CSV file is provided to me, I have no control over how it is generated. Good point. I have heard that it's possible to design a context-based lexical analyzer to run through the file and change the comma's(separating the collection of strings) to another character.
  • K_U
    K_U over 14 years
    I managed to read the file using xlsread in matlab. I used the option where i can get the function to give me numeric, text and raw data in different matrices.
  • K_U
    K_U over 14 years
    I tried converting into xls, but the number of rows far exceeds the max limit. It's actually ~263000. I think xls has a max limit of 65535 by ~277 or something. I managed to read it using xlsread on the csv file itself. Thank you
  • Gabriel McAdams
    Gabriel McAdams over 14 years
    Excel can only have commas in the field because they are setting a text qualifier (") for that field. Without one, even Excel can't figure out that it should be a single field containing commas (side note: Excel 2007 allows over a million rows)
  • Gabriel McAdams
    Gabriel McAdams over 14 years
    You can't differentiate field delimiters from commas in the text when there are commas in the fields and no text qualifiers. Are you able to talk to those who generate this CSV and get them to use a text qualifier?
  • Kena
    Kena over 14 years
    +1 I've found that MATLAB "auto-loading" features are not extremely robust when your data is not just well-behaved numbers. I've even had issues with buggy undocumented features (loading a hexadecimal numbers). It sucks, but when in doubt, better implement the parsing yourself.