Parse CSV with double quote in some cases

29,008

Solution 1

You could use Matcher.find with the following regular expression:

\s*("[^"]*"|[^,]*)\s*

Here's a more complete example:

String s = "a1, a2, a3, \"a4,a5\", a6";
Pattern pattern = Pattern.compile("\\s*(\"[^\"]*\"|[^,]*)\\s*");
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
    System.out.println(matcher.group(1));
}

See it working online: ideone

Solution 2

Here is some code for you, I hope using code out of here doesn't count open source, which is.

package bestsss.util;

import java.io.BufferedReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

public class SplitCSVLine {
    public static String[] splitCSV(BufferedReader reader) throws IOException{
        return splitCSV(reader, null, ',', '"');
    }

    /**
     * 
     * @param reader - some line enabled reader, we lazy
     * @param expectedColumns - convenient int[1] to return the expected
     * @param separator - the C(omma) SV (or alternative like semi-colon) 
     * @param quote - double quote char ('"') or alternative
     * @return String[] containing the field
     * @throws IOException
     */
    public static String[] splitCSV(BufferedReader reader, int[] expectedColumns, char separator, char quote) throws IOException{       
        final List<String> tokens = new ArrayList<String>(expectedColumns==null?8:expectedColumns[0]);
        final StringBuilder sb = new StringBuilder(24);

        for(boolean quoted=false;;sb.append('\n')) {//lazy, we do not preserve the original new line, but meh
            final String line = reader.readLine();
            if (line==null)
                break;
            for (int i = 0, len= line.length(); i < len; i++) { 
                final char c = line.charAt(i);
                if (c == quote) {
                    if( quoted   && i<len-1 && line.charAt(i+1) == quote ){//2xdouble quote in quoted 
                        sb.append(c);
                        i++;//skip it
                    }else{
                        if (quoted){
                            //next symbol must be either separator or eol according to RFC 4180
                            if (i==len-1 || line.charAt(i+1) == separator){
                                quoted = false;
                                continue;
                            }
                        } else{//not quoted
                            if (sb.length()==0){//at the very start
                                quoted=true;
                                continue;
                            }
                        }
                        //if fall here, bogus, just add the quote and move on; or throw exception if you like to
                        /*
                        5.  Each field may or may not be enclosed in double quotes (however
                           some programs, such as Microsoft Excel, do not use double quotes
                           at all).  If fields are not enclosed with double quotes, then
                           double quotes may not appear inside the fields.
                      */ 
                        sb.append(c);                   
                    }
                } else if (c == separator && !quoted) {
                    tokens.add(sb.toString());
                    sb.setLength(0); 
                } else {
                    sb.append(c);
                }
            }
            if (!quoted)
                break;      
        }
        tokens.add(sb.toString());//add last
        if (expectedColumns !=null)
            expectedColumns[0] = tokens.size();
        return tokens.toArray(new String[tokens.size()]);
    }
    public static void main(String[] args) throws Throwable{
        java.io.StringReader r = new java.io.StringReader("222,\"\"\"zzzz\", abc\"\" ,   111   ,\"1\n2\n3\n\"");
        System.out.println(java.util.Arrays.toString(splitCSV(new BufferedReader(r))));
    }
}

Solution 3

I came across this same problem (but in Python), one way I found to solve it, without regexes, was: When you get the line, check for any quotes, if there are quotes, split the string on quotes, and split the even indexed results of the resulting array on commas. The odd indexed strings should be the full quoted values.

I'm no Java coder, so take this as pseudocode...

line = String[];
    if ('"' in row){
        vals = row.split('"');
        for (int i =0; i<vals.length();i+=2){
            line+=vals[i].split(',');
        }
        for (int j=1; j<vals.length();j+=2){
            line+=vals[j];
        }
    }
    else{
        line = row.split(',')
    }

Alternatively, use a regex.

Solution 4

The below code seems to work well and can handle quotes within quotes.

final static Pattern quote = Pattern.compile("^\\s*\"((?:[^\"]|(?:\"\"))*?)\"\\s*,");

public static List<String> parseCsv(String line) throws Exception
{       
    List<String> list = new ArrayList<String>();
    line += ",";

    for (int x = 0; x < line.length(); x++)
    {
        String s = line.substring(x);
        if (s.trim().startsWith("\""))
        {
            Matcher m = quote.matcher(s);
            if (!m.find())
                throw new Exception("CSV is malformed");
            list.add(m.group(1).replace("\"\"", "\""));
            x += m.end() - 1;
        }
        else
        {
            int y = s.indexOf(",");
            if (y == -1)
                throw new Exception("CSV is malformed");
            list.add(s.substring(0, y));
            x += y;
        }
    }
    return list;
}
Share:
29,008
HP.
Author by

HP.

Updated on August 15, 2022

Comments

  • HP.
    HP. over 1 year

    I have csv that comes with format:

    a1, a2, a3, "a4,a5", a6

    Only field with , will have quotes

    Using Java, how to easily parse this? I try to avoid using open source CSV parser as company policy. Thanks.