Use Scala parser combinator to parse CSV files

14,247

Solution 1

What you missed is whitespace. I threw in a couple bonus improvements.

import scala.util.parsing.combinator._

object CSV extends RegexParsers {
  override protected val whiteSpace = """[ \t]""".r

  def COMMA   = ","
  def DQUOTE  = "\""
  def DQUOTE2 = "\"\"" ^^ { case _ => "\"" }
  def CR      = "\r"
  def LF      = "\n"
  def CRLF    = "\r\n"
  def TXT     = "[^\",\r\n]".r

  def file: Parser[List[List[String]]] = repsep(record, CRLF) <~ opt(CRLF)
  def record: Parser[List[String]] = rep1sep(field, COMMA)
  def field: Parser[String] = (escaped|nonescaped)
  def escaped: Parser[String] = (DQUOTE~>((TXT|COMMA|CR|LF|DQUOTE2)*)<~DQUOTE) ^^ { case ls => ls.mkString("")}
  def nonescaped: Parser[String] = (TXT*) ^^ { case ls => ls.mkString("") }

  def parse(s: String) = parseAll(file, s) match {
    case Success(res, _) => res
    case _ => List[List[String]]()
  }
}

Solution 2

With Scala Parser Combinators library out of the Scala standard library starting from 2.11 there is no good reason not to use the much more performant Parboiled2 library. Here is a version of the CSV parser in Parboiled2's DSL:

/*  based on comments in https://github.com/sirthias/parboiled2/issues/61 */
import org.parboiled2._
case class Parboiled2CsvParser(input: ParserInput, delimeter: String) extends Parser {
  def DQUOTE = '"'
  def DELIMITER_TOKEN = rule(capture(delimeter))
  def DQUOTE2 = rule("\"\"" ~ push("\""))
  def CRLF = rule(capture("\r\n" | "\n"))
  def NON_CAPTURING_CRLF = rule("\r\n" | "\n")

  val delims = s"$delimeter\r\n" + DQUOTE
  def TXT = rule(capture(!anyOf(delims) ~ ANY))
  val WHITESPACE = CharPredicate(" \t")
  def SPACES: Rule0 = rule(oneOrMore(WHITESPACE))

  def escaped = rule(optional(SPACES) ~
    DQUOTE ~ (zeroOrMore(DELIMITER_TOKEN | TXT | CRLF | DQUOTE2) ~ DQUOTE ~
    optional(SPACES)) ~> (_.mkString("")))
  def nonEscaped = rule(zeroOrMore(TXT | capture(DQUOTE)) ~> (_.mkString("")))

  def field = rule(escaped | nonEscaped)
  def row: Rule1[Seq[String]] = rule(oneOrMore(field).separatedBy(delimeter))
  def file = rule(zeroOrMore(row).separatedBy(NON_CAPTURING_CRLF))

  def parsed() : Try[Seq[Seq[String]]] = file.run()
}

Solution 3

The default whitespace for RegexParsers parsers is \s+, which includes new lines. So CR, LF and CRLF never get a chance to be processed, as it is automatically skipped by the parser.

Share:
14,247
Rio
Author by

Rio

Updated on June 03, 2022

Comments

  • Rio
    Rio almost 2 years

    I'm trying to write a CSV parser using Scala parser combinators. The grammar is based on RFC4180. I came up with the following code. It almost works, but I cannot get it to correctly separate different records. What did I miss?

    object CSV extends RegexParsers {
      def COMMA   = ","
      def DQUOTE  = "\""
      def DQUOTE2 = "\"\"" ^^ { case _ => "\"" }
      def CR      = "\r"
      def LF      = "\n"
      def CRLF    = "\r\n"
      def TXT     = "[^\",\r\n]".r
      
      def file: Parser[List[List[String]]] = ((record~((CRLF~>record)*))<~(CRLF?)) ^^ { 
        case r~rs => r::rs
      }
      def record: Parser[List[String]] = (field~((COMMA~>field)*)) ^^ {
        case f~fs => f::fs
      }
      def field: Parser[String] = escaped|nonescaped
      def escaped: Parser[String] = (DQUOTE~>((TXT|COMMA|CR|LF|DQUOTE2)*)<~DQUOTE) ^^ { case ls => ls.mkString("")}
      def nonescaped: Parser[String] = (TXT*) ^^ { case ls => ls.mkString("") }
    
      def parse(s: String) = parseAll(file, s) match {
        case Success(res, _) => res
        case _ => List[List[String]]()
      }
    }
    
    
    println(CSV.parse(""" "foo", "bar", 123""" + "\r\n" + 
      "hello, world, 456" + "\r\n" +
      """ spam, 789, egg"""))
    
    // Output: List(List(foo, bar, 123hello, world, 456spam, 789, egg)) 
    // Expected: List(List(foo, bar, 123), List(hello, world, 456), List(spam, 789, egg))
    

    Update: problem solved

    The default RegexParsers ignore whitespaces including space, tab, carriage return, and line breaks using the regular expression [\s]+. The problem of the parser above unable to separate records is due to this. We need to disable skipWhitespace mode. Replacing whiteSpace definition to just [ \t]} does not solve the problem because it will ignore all spaces within fields (thus "foo bar" in the CSV becomes "foobar"), which is undesired. The updated source of the parser is thus

    import scala.util.parsing.combinator._
    
    // A CSV parser based on RFC4180
    // https://www.rfc-editor.org/rfc/rfc4180
    
    object CSV extends RegexParsers {
      override val skipWhitespace = false   // meaningful spaces in CSV
    
      def COMMA   = ","
      def DQUOTE  = "\""
      def DQUOTE2 = "\"\"" ^^ { case _ => "\"" }  // combine 2 dquotes into 1
      def CRLF    = "\r\n" | "\n"
      def TXT     = "[^\",\r\n]".r
      def SPACES  = "[ \t]+".r
    
      def file: Parser[List[List[String]]] = repsep(record, CRLF) <~ (CRLF?)
    
      def record: Parser[List[String]] = repsep(field, COMMA)
    
      def field: Parser[String] = escaped|nonescaped
    
    
      def escaped: Parser[String] = {
        ((SPACES?)~>DQUOTE~>((TXT|COMMA|CRLF|DQUOTE2)*)<~DQUOTE<~(SPACES?)) ^^ { 
          case ls => ls.mkString("")
        }
      }
    
      def nonescaped: Parser[String] = (TXT*) ^^ { case ls => ls.mkString("") }
    
    
    
      def parse(s: String) = parseAll(file, s) match {
        case Success(res, _) => res
        case e => throw new Exception(e.toString)
      }
    }
    
  • Daniel C. Sobral
    Daniel C. Sobral about 13 years
    How does that differ from protected val whiteSpace = """\s+""".r, which is RegexParsers's default? -- Ah, got it. Newline is space as well, so your override removed it from consideration.
  • Rio
    Rio about 13 years
    Thank you very much for pointing out the white space issue! Your solution correctly parses different records. However it also ignores spaces within fields. Please see my updated question to see my solution after adopting your changes.
  • djsumdog
    djsumdog over 10 years
    Change the CRLFs in file to CRLF|LF for both of them if you want to support non-windows line feeds (it's just \n in Linux)
  • harschware
    harschware over 9 years
    Since you went through the effort of writing such a nice blog about it, we might as well post the link here :-) maciejb.me/2014/07/11/…
  • Toby
    Toby about 9 years
    Shouldn't CRLF = rule(capture("\n\r" | "\n")) be CRLF = rule(capture("\r\n" | "\n"))? and again for NON_CAPTURING_CRLF?
  • Maciej Biłas
    Maciej Biłas about 9 years
    @Toby of course it should! Thank you for pointing that out, I've corrected the answer.
  • Toby
    Toby about 9 years
    Great stuff. Shouldn't it support (double) quoted values out of the box? Looks to me like it should but it doesn't parse it as I'd expect. ie, "a,b", "c"
  • Maciej Biłas
    Maciej Biłas about 9 years
    @Toby it sure should! I've fixed that one as well. :-)