String parsing in Haskell

18,970

Solution 1

Since Strings are simply lists of Chars in Haskell, Data.List would be a good place to start looking (in the interest of learning Haskell).

For more complex cases (where commas may be nested inside quotes and should be ignored, for example), parsec (as Daniel mentioned) would be a better solution.

Also, if you're looking to parse CSVs you may try Text.CSV, though I've not tried it, so I can't say how helpful it'll be.

Solution 2

I finally decided to roll my own parsing functions since this is such a simple situation. I have learned a lot about Haskell since I first posted this question and want to document my solution here:

split :: Char -> String -> [String]
split _ "" = []
split c s = firstWord : (split c rest)
    where firstWord = takeWhile (/=c) s
          rest = drop (length firstWord + 1) s

removeChar :: Char -> String -> String
removeChar _ [] = []
removeChar ch (c:cs)
    | c == ch   = removeChar ch cs
    | otherwise = c:(removeChar ch cs)

main = do
    handle <- openFile "input/names.txt" ReadMode
    contents <- hGetContents handle
    let names = sort (map (removeChar '"') (split ',' contents))
    print names
    hClose handle

Solution 3

The most powerful solution is a parser combinator. Haskell has several of these, but the foremost that come to my mind are:

  • parsec: a very good general-purpose parsing library
  • attoparsec: a faster version of parsec, which sacrifices the quality of error messages and some other features for extra speed
  • uu-parsinglib: a very powerful parsing library

The big advantage of parser combinators is that it is very easy to define parsers using do notation (or Applicative style, if you prefer).

If you just want some quick and simple string manipulation capabilities, then consult the text library (for high-performance byte-encoded strings), or Data.List (for ordinary list-encoded strings), which provide the necessary functions to manipulate strings.

Solution 4

Here's a particularly cheeky way to proceed:

parseCommaSepQuotedWords :: String -> [String]
parseCommaSepQuotedWords s = read ("[" ++ s ++ "]")

This might work but it's very fragile and rather silly. Essentially you are using the fact that the Haskell way of writing lists of strings almost coincides with your way, and hence the built-in Read instance is almost the thing you want. You could use reads for better error-reporting but in reality you probably want to do something else entirely.

In general, parsec is really worth taking a look at - it's a joy to use, and one of the things that originally really got me excited about Haskell. But if you want a homegrown solution, I often write simple things using case statements on the result of span and break. Suppose you are looking for the next semicolon in the input. Then break (== ';') inp will return (before, after), where:

  • before is the content of inp up to (and not including) the first semicolon (or all of it if there is none)
  • after is the rest of the string:
    • if after is not empty, the first element is a semicolon
    • regardless of what else happens, before ++ after == inp

So to parse a list of statements separated by semicolons, I might do this:

parseStmts :: String -> Maybe [Stmt]
parseStmts inp = case break (== ';') inp of
  (before, _ : after) -> -- ...
    -- ^ before is the first statement
    --     ^ ignore the semicolon
    --           ^ after is the rest of the string
  (_, []) -> -- inp doesn't contain any semicolons

Solution 5

In the interest of having a complete answer for those who happen upon this question, Data.Text has some good functions as well.

Share:
18,970
Code-Apprentice
Author by

Code-Apprentice

I primarily program in C++ and Java. Recently I started learning Haskell. My current mathematical interests are group theory, graph theory, category theory, and type theory. I also enjoy playing chess and Go. My Amazon wishlist

Updated on June 04, 2022

Comments

  • Code-Apprentice
    Code-Apprentice almost 2 years

    I am very new to Haskell and am currently trying to solve a problem that requires some string parsing. My input String contains a comma-delimited list of words in quotes. I want to parse this single string into a list of the words as Strings. Where should I start learning about parsing such a String? Is there a partuclar module and/or functions that will be helpful?

    p.s. Please don't post a full solution. I am just asking for a pointer to a starting place so I can learn how to do it.

  • Ben Millwood
    Ben Millwood almost 12 years
    When I was a noob I could not make heads nor tails of uu-parsinglib. I haven't tried it since then, but I wouldn't exactly call it friendly.
  • Richard Careaga
    Richard Careaga over 8 years
    This link is now at therning.org/magnus/posts/…; see wiki.haskell.org/Parsec Sec. 5.2 for other links in the series and additional resources