How to parse a dirty CSV with Node.js?

11,336

Solution 1

The data is not too messed up to work with. There is a clear pattern.

General steps:

  1. Temporarily remove mixed format inner fields (beginning with double(or more) quotes and having all kinds of characters.
  2. Remove quotes from start and end of quoted lines giving clean CSV
  3. Split data into columns
  4. Replace removed fields

Step 1 above is the most important. If you apply this then the problems with new lines, empty rows and quotes and commas disappear. If you look in the data you can see columns 7, 8 and 9 contain mixed data. But it is always delimited by 2 quotes or more. e.g.

good,clean,data,here,"""<-BEGINNING OF FIELD DATA> Oh no
++\n\n<br/>whats happening,, in here, pages of chinese
characters etc END OF FIELD ->""",more,clean,data

Here is a working example based on the file provided:

fs.readFile('./data_stack.csv', (e, data) => {

    // Take out fields that are delimited with double+ quotes
    var dirty = data.toString();
    var matches = dirty.match(/""[\s\S]*?""/g);
    matches.forEach((m,i) => {
        dirty = dirty.replace(m, "<REPL-" + i + ">");
    });

    var cleanData =   dirty
        .split('\n') // get lines

        // ignore first line with column names
        .filter((l, i) => i > 0)

        // remove first and last quotation mark if exists
        .map(l => l[0] === '"' ? l.substring(1, l.length-2) : l) // remove quotes from quoted lines

        // split into columns
        .map(l => l.split(','))

        // return replaced fields back to data (columsn 7,8 and 9)
        .map(col => {

            if (col.length > 9) {
                col[7] = returnField(col[7]);
                col[8] = returnField(col[8]);
                col[9] = returnField(col[9]);
            }
            return col;

            function returnField(f) {
                if (f) {
                    var repls = f.match(/<.*?>/g)
                    if (repls)
                        repls.forEach(m => {
                            var num = +m.split('-')[1].split('>')[0];
                            f = f.replace(m, matches[num]);
                        });
                }
                return f;
            }
        })

    return cleanData
});

Result:

Data looks pretty clean. All rows produce the expected number of columns matching the header (last 2 rows shown):

  ...,
  [ '19403',
    '560e348d2adaffa66f72bfc9',
    'done',
    '276',
    '2015-10-02T07:38:53.172Z',
    '20151002',
    '560e31f69cd6d5059668ee16',
    '""560e336ef3214201030bf7b5""',
    'a+�a��a+�a+�a��a+�a��a+�a��',
    '',
    '560e2e362adaffa66f72bd99',
    '55f8f041b971644d7d861502',
    'foo',
    'foo',
    '[email protected]',
    'bar.com' ],
  [ '20388',
    '560ce1a467cf15ab2cf03482',
    'update',
    '231',
    '2015-10-01T07:32:52.077Z',
    '20151001',
    '560ce1387494620118c1617a',
    '""""""Final test, with a comma""""""',
    '',
    '',
    '55e6dff9b45b14570417a908',
    '55e6e00fb45b14570417a92f',
    'foo',
    'foo',
    '[email protected]',
    'bar.com' ],

Solution 2

I don't know how to make this CSV clean, neither with R nor with Node.js.

Actually, it is not as bad as it looks.

This file can easily be converted to a valid csv using the following steps:

  • replace all "" with ".
  • replace all \n" with \n.
  • replace all "\n with \n.

With \n meaning a newline, not the characters "\n" which also appear in your file.

Note that in your example file \n is actually \r\n (0x0d, 0x0a), so depending on the software you use you may need to replace \n in \r\n in the above examples. Also, in your example there is a newline after the last row, so a quote as the last character will be replaced too, but you might want to check this in the original file.

This should produce a valid csv file:

enter image description here enter image description here

There will still be multiline fields, but that was probably intended. But now those are properly quoted and any decent csv parser should be able to handle multiline fields.


It looks like the original data has had an extra pass for escaping quote characters:

  • If the original fields contained a , they were quoted, and if those fields already contained quotes, the quotes were escaped with another quote - which is the right way to do.

  • But then all rows containing a quote seem to have been quoted again (actually converting those rows to one quoted field), and all the quotes inside that row were escaped with another quote.

  • Obviously, something went wrong with the multiline fields. Quotes were added between the multiple lines too, which is not the right way to do.

Share:
11,336
Synleb
Author by

Synleb

Updated on June 19, 2022

Comments

  • Synleb
    Synleb about 2 years

    I'm scratching my head on a CSV file I cannot parse correctly, due to many errors. I extracted a sample you can download here: Test CSV File

    Main errors (or what generated an error) are:

    • Quotes & commas (many errors when trying to parse the file with R)
    • Empty rows
    • Unexpected line break inside a field

    I first decided to use Regular Expression line by line to clean the data before loading them into R but couldn't solve the problem and it was two slow (200Mo file)

    So I decided to use a CSV parser under Node.js with the following code:

    'use strict';
    
    const Fs  = require('fs');
    const Csv = require('csv');
    
    let input       = 'data_stack.csv';
    let readStream  = Fs.createReadStream(input);
    let option      = {delimiter: ',', quote: '"', escape: '"', relax: true};
    
    let parser = Csv.parse(option).on('data', (data) => {
        console.log(data)
    });
    
    readStream.pipe(parser)
    

    But:

    • Some rows are parsed correctly (array of strings)
    • Some are not parsed (all fields are one string)
    • Some rows are still empty (can be solve by adding skip_empty_lines: true to the options)
    • I don't know how to handle the unexpected line break.

    I don't know how to make this CSV clean, neither with R nor with Node.js.

    Any help?

    EDIT:

    Following @Danny_ds solution, I can parse it correctly. Now I cannot stringify it back correctly.

    with console.log(); I get a proper object but when I'm trying to stringify it, I don't get a clean CSV (still have line break and empty rows).

    Here is the code I'm using:

    'use strict';
    
    const Fs  = require('fs');
    const Csv = require('csv');
    
    
    let input  = 'data_stack.csv';
    let output = 'data_output.csv';
    
    let readStream  = Fs.createReadStream(input);
    let writeStream = Fs.createWriteStream(output);
    
    let opt  = {delimiter: ',', quote: '"', escape: '"', relax: true, skip_empty_lines: true};
    
    
    let transformer = Csv.transform(data => {
        let dirty = data.toString();
        let replace = dirty.replace(/\r\n"/g, '\r\n').replace(/"\r\n/g, '\r\n').replace(/""/g, '"');
    
        return replace;
    });
    
    let parser = Csv.parse(opt);
    let stringifier = Csv.stringify();
    
    readStream.pipe(transformer).pipe(parser).pipe(stringifier).pipe(writeStream);
    

    EDIT 2:

    Here is the final code that works:

    'use strict';
    
    const Fs  = require('fs');
    const Csv = require('csv');
    
    
    let input  = 'data_stack.csv';
    let output = 'data_output.csv';
    
    let readStream  = Fs.createReadStream(input);
    let writeStream = Fs.createWriteStream(output);
    
    let opt  = {delimiter: ',', quote: '"', escape: '"', relax: true, skip_empty_lines: true};
    
    
    let transformer = Csv.transform(data => {
        let dirty = data.toString();
        let replace = dirty
            .replace(/\r\n"/g, '\r\n')
            .replace(/"\r\n/g, '\r\n')
            .replace(/""/g, '"');
    
        return replace;
    });
    
    let parser = Csv.parse(opt);
    
    let cleaner = Csv.transform(data => {
        let clean = data.map(l => {
            if (l.length > 100 || l[0] === '+') {
                return l = "Encoding issue";
            }
            return l;
        });
        return clean;
    });
    
    let stringifier = Csv.stringify();
    
    readStream.pipe(transformer).pipe(parser).pipe(cleaner).pipe(stringifier).pipe(writeStream);
    

    Thanks to everyone!

  • Synleb
    Synleb over 8 years
    Thanks Julian. Just one thing though regarding your first point (and the second by the way). How can I count the commas by not taking into account enclosed commas that can be enclosed inside a quoted string. And by applying Regex to remove the double quotes, I leave enclosed commas in the wild.
  • Julian Knight
    Julian Knight over 8 years
    That's why I asked whether the data could contain comma's. If it can, I'm not sure you can fix the data without manually checking it and possibly not even then. Not all CSV data will contain embedded commas which is why the quotes around the data are actually optional. Though in your case you have many mismatched quotes which is a concern as it either indicates corrupted data or that the data itself is actually binary some of which is showing up as quotes.
  • Julian Knight
    Julian Knight over 8 years
    I should have also said that, without knowing the origin of the data, it is almost impossible to provide a definitive answer.
  • Synleb
    Synleb over 8 years
    Just tried with Fs.readFile('data_stack.csv', (err, data) => { data.toString().replace(/""/g, '"').replace(/[\r\n]"/g, '\n').replace(/"[\r\n]/g, '\n'); Fs.writeFile('data_output.csv', data); }) and it doesn't work.
  • Danny_ds
    Danny_ds over 8 years
    You need something like: .replace(/\r\n"/g, '\r\n') or .replace(/\n"/g, '\n'). Idem for the last replace.
  • Danny_ds
    Danny_ds over 8 years
    @Synleb - Well, it's normal that there are still newlines and empty lines (should not be empty csv-rows), because there is a multiline field in your data (column 8 / R2) - which is valid csv if the multiline field is quoted, which should be the case after the first cleanup. If you don't want that, you could remove the newlines in that field only, after parsing the file.
  • Danny_ds
    Danny_ds over 8 years
    @Synleb - But you'll have to make sure your csv parser supports multiline fields of course (I didn't see an option for that on the site you link to). But since you're saying that it is parsed correctly, I guess that's the case.
  • Synleb
    Synleb over 8 years
    You're right, the field is still on multiline but the Csv is now valid when parsed. I just removed the malformed field and it's clean. Thanks a lot!
  • Synleb
    Synleb over 8 years
    Btw, how did you know about CR/LF in my csv? That \n was actually \r\n in my file?
  • Danny_ds
    Danny_ds over 8 years
    @Synleb - No problem, glad to help. About CR/LF: I looked at your data with a hex editor.