Remove comma between the quotes only in a comma delimited file

70,850

Solution 1

If the quotes are balanced, you will want to remove commas between every other quote, this can be expressed in awk like this:

awk -F'"' -v OFS='' '{ for (i=2; i<=NF; i+=2) gsub(",", "", $i) } 1' infile

Output:

123,ABC DEV 23,345,534.202,NAME

Explanation

The -F" makes awk separate the line at the double-quote signs, which means every other field will be the inter-quote text. The for-loop runs gsub, short for globally substitute, on every other field, replacing comma (",") with nothing (""). The 1 at the end invokes the default code-block: { print $0 }.

Solution 2

There is a good response, using sed simply one time with a loop:

echo '123,"ABC, DEV 23",345,534,"some more, comma-separated, words",202,NAME'|
  sed ':a;s/^\(\([^"]*,\?\|"[^",]*",\?\)*"[^",]*\),/\1 /;ta'
123,"ABC  DEV 23",345,534,"some more  comma-separated  words",202,NAME

Explanation:

  • :a; is a label for furter branch
  • s/^\(\([^"]*,\?\|"[^",]*",\?\)*"[^",]*\),/\1 / could contain 3 enclosed parts
    • first the 2nd: [^"]*,\?\|"[^",]*",\? match for a string containing no double quote, maybe followed by a coma or a string enclosed by two double quote, without coma and maybe followed by a coma.
    • than the first RE part is composed by as many repetition of previously described part 2, followed by 1 double quote and some caracteres, but no double-quote, nor comas.
    • The first RE part as to be followed by a coma.
    • Nota, the rest of the line don't need to be touched
  • ta will loop to :a if previous s/ command did some change.

Once loop done, you could even add s/ */ /g:

echo '123,"ABC, DEV 23",345,534,"some more, comma-separated, words",202,NAME'|
    sed ':a;s/^\(\([^"]*,\?\|"[^",]*",\?\)*"[^",]*\),/\1 /;ta;s/  */ /g'

will suppress double spaces:

123,"ABC DEV 23",345,534,"some more comma-separated words",202,NAME

Solution 3

A general solution that can also handle several commas between balanced quotes needs a nested substitution. I implement a solution in perl, which process every line of a given input and only substitute commas in every other pair of quotes:

perl -pe 's/ "  (.+?  [^\\])  "               # find all non escaped 
                                              # quoting pairs
                                              # in a non-greedy way

           / ($ret = $1) =~ (s#,##g);         # remove all commas within quotes
             $ret                             # substitute the substitution :)
           /gex'

or in short

perl -pe 's/"(.+?[^\\])"/($ret = $1) =~ (s#,##g); $ret/ge'

You can either pipe the text you want to process to the command or specify the textfile to be processed as last command line argument.

Solution 4

I would use a language with a proper CSV parser. For example:

ruby -r csv -ne '
  CSV.parse($_) do |row|
    newrow = CSV::Row.new [], []
    row.each {|field| newrow << field.delete(",")}
    puts newrow.to_csv
  end
' < input_file

Solution 5

Your second quotes are misplaced:

sed -e 's/\(".*\),\(.*"\)/\1 \2/g'

In addition, using regular expressions tend to match the longest possible part of the text, meaning this will not work if you have more than one quoted field in the string.

A way that handles multiple quoted fields in sed

sed -e 's/\(\"[^",]\+\),\([^",]*\)/\1 \2/g' -e 's/\"//g'

This is also a way to solve this, however, with input that may contain more than one comma per quoted field the first expression in the sed would have to be repeated as many times as the maximum comma content in a single field, or until it does not change the output at all.

Running sed with the more than one expression should be more efficient than several sed processes running and a "tr" all running with open pipes.

However, this may have undesired consequences if the input is not properly formatted. i.e. nested quotes, unterminated quotes.

Using the running example:

echo '123,"ABC, DEV 23",345,534,"some more, comma-separated, words",202,NAME' \
| sed -e 's/\(\"[^",]\+\),\([^",]*\)/\1 \2/g' \
-e 's/\(\"[^",]\+\),\([^",]*\)/\1 \2/g' -e 's/\"//g'

Output:

123,ABC  DEV 23,345,534,some more  comma-separated  words,202,NAME
Share:
70,850

Related videos on Youtube

ekoeppen
Author by

ekoeppen

Updated on September 18, 2022

Comments

  • ekoeppen
    ekoeppen over 1 year

    I have a input file delimited with commas (,). There are some fields enclosed in double quotes that are having a comma in them. Here is the sample row

    123,"ABC, DEV 23",345,534.202,NAME
    

    I need to remove all the comma's occuring within inside the double quotes and the double quotes as well. So the above line should get parsed into as shown below

    123,ABC DEV 23,345,534.202,NAME
    

    I tried the following using sed but not giving expected results.

    sed -e 's/\(".*\),\(".*\)/\1 \2/g'
    

    Any quick tricks with sed, awk or any other unix utility please?

  • tojrobinson
    tojrobinson over 11 years
    The [^\\] is going to have the undesired effect of matching the last character inside the quotes and removing it (non \ character), i.e., you should not consume that character. Try (?<!\\) instead.
  • user1146332
    user1146332 over 11 years
    Thanks for your objection, I have corrected that. Nevertheless I think we don't need look behind assertion here, or do we!?
  • tojrobinson
    tojrobinson over 11 years
    Including the non \ in your capture group produces an equivalent result. +1
  • Alessio
    Alessio over 11 years
    +1. after trying a few things with sed, I checked sed's docs and confirmed that it can't apply a replace to just the matching portion of a line...so gave up and tried perl. Ended up with a very similar approach but this version uses [^"]* to make the match non-greedy (i.e. matches everything from one " to the next "): perl -pe 's/"([^"]+)"/($match = $1) =~ (s:,::g);$match;/ge;'. It does not acknowledge the outlandish idea that a quote might be escaped with a backslash :-)
  • user1146332
    user1146332 over 11 years
    Thanks for your comment. Would be interesting if either the [^"]* approach or the explicit non-greedy approach consumes less cpu time.
  • Thor
    Thor over 11 years
    @CraigSanders: with GNU sed you sort of can with the execute flag and sub-shelling, e.g.: sed -r 's/(.*)"([^"]+)"(.*)/(echo -n "\1"; echo -n "\2" | tr -d ','; echo -n "\3")/e'
  • Alessio
    Alessio over 11 years
    @Thor: nice. I started playing with /e but couldn't get it to replace multiple commas inside quotes or cope with multiple quote-delimited sections on a line. also, perl is easier to read and understand when sed scripts start getting too complicated.
  • ekoeppen
    ekoeppen over 11 years
    Please can you elaborate on gsub and explain in short, how this one liner works?? please.
  • Thor
    Thor over 11 years
    @CraigSanders: Agreed, that soon becomes too complicated. I did find a simpler two tier sed alternative which I added to my answer.
  • Stéphane Chazelas
    Stéphane Chazelas over 11 years
    That solution doesn't work in cases like echo '"", "asd"' | perl -pe 's/"(.+?[^\\])"/($ret = $1) =~ (s#,##g); $ret/ge'
  • Thor
    Thor over 11 years
    You can make it more general with conditional branching and more readable with ERE, e.g. with GNU sed: sed -r ':r; s/("[^",]+),([^",]*)/\1 \2/g; tr; s/"//g'.
  • CocoaEv
    CocoaEv almost 10 years
    Thank you! This scripts works really well, but could you explain the lonely 1 at the end of the script? -- } 1' --
  • Thor
    Thor almost 10 years
    @CocoaEv: It executes { print $0 }. I added that to the explanation as well.
  • tricasse
    tricasse about 9 years
    Works also with nested quotes. Awesome, thanks!
  • Olivier Dulac
    Olivier Dulac over 7 years
    this approach has a problem: sometimes the csv has rows that span several lines, such as: prefix,"something,otherthing[newline]something , else[newline]3rdline,and,things",suffix (ie: several lines, and nested "," anywhere within a multi-line double-quoting : the whole "...." part should be rejoined and inside , should be replaced/removed ... ) : your script won't see pairs of double quotes in that case, and it's not really easy to solve (need to "rejoin" lines that are in an "open" (ie, odd-numbered) double quote...+take extra care if there is also an escaped \" inside the string)
  • KIC
    KIC over 7 years
    while I liked this solution initially, it turned out to be incredible slow for big files ...
  • Danton Noriega
    Danton Noriega over 5 years
    Loved this solution but I tweaked it given I often like to keep the commas but still want to delimit. Instead, I switched the commas outside the quotes to pipes, converting the csv to a psv file: awk -F'"' -v OFS='"' '{ for (I=1; i<=NF; i+=2) gsub(",", "|", $i) } 1' infile
  • Anne van Rossum
    Anne van Rossum almost 4 years
    Tried it on the command itself. :-) echo ':a;s/^\(\([^"]*,\?\|"[^",]*",\?\)*"[^",]*\),/\1 /;ta' | sed ':a;s/^\(\([^"]*,\?\|"[^",]*",\?\)*"[^",]*\),/\1 /;ta' becomes :a;s/^\(\([^"]* \?\|"[^" ]*",\?\)*"[^",]*\),/\1 /;ta
  • avocado
    avocado over 2 years
    echo '1,2,"abc,de,f",10' | sed ':a;s/"\([^",]*\),\(.*\)"/"\1 \2"/;ta'
  • they
    they over 2 years
    You'd use csvformat -d '^' to convert the file back to using commas as delimiters. However, what you're doing seems to work, but only because all non-delimiting commas are removed.