Remove comma between the quotes only in a comma delimited file

text-processing sed awk csv

70,850

Solution 1

If the quotes are balanced, you will want to remove commas between every other quote, this can be expressed in awk like this:

awk -F'"' -v OFS='' '{ for (i=2; i<=NF; i+=2) gsub(",", "", $i) } 1' infile

Output:

123,ABC DEV 23,345,534.202,NAME

Explanation

The -F" makes awk separate the line at the double-quote signs, which means every other field will be the inter-quote text. The for-loop runs gsub, short for globally substitute, on every other field, replacing comma (",") with nothing (""). The 1 at the end invokes the default code-block: { print $0 }.

Solution 2

There is a good response, using sed simply one time with a loop:

echo '123,"ABC, DEV 23",345,534,"some more, comma-separated, words",202,NAME'|
  sed ':a;s/^\(\([^"]*,\?\|"[^",]*",\?\)*"[^",]*\),/\1 /;ta'
123,"ABC  DEV 23",345,534,"some more  comma-separated  words",202,NAME

Explanation:

:a; is a label for furter branch
s/^$\([^"]*,\?\|"[^",]*",\?$*"[^",]*\),/\1 / could contain 3 enclosed parts
- first the 2nd: [^"]*,\?\|"[^",]*",\? match for a string containing no double quote, maybe followed by a coma or a string enclosed by two double quote, without coma and maybe followed by a coma.
- than the first RE part is composed by as many repetition of previously described part 2, followed by 1 double quote and some caracteres, but no double-quote, nor comas.
- The first RE part as to be followed by a coma.
- Nota, the rest of the line don't need to be touched
ta will loop to :a if previous s/ command did some change.

Once loop done, you could even add s/ */ /g:

echo '123,"ABC, DEV 23",345,534,"some more, comma-separated, words",202,NAME'|
    sed ':a;s/^\(\([^"]*,\?\|"[^",]*",\?\)*"[^",]*\),/\1 /;ta;s/  */ /g'

will suppress double spaces:

123,"ABC DEV 23",345,534,"some more comma-separated words",202,NAME

Solution 3

A general solution that can also handle several commas between balanced quotes needs a nested substitution. I implement a solution in perl, which process every line of a given input and only substitute commas in every other pair of quotes:

perl -pe 's/ "  (.+?  [^\\])  "               # find all non escaped 
                                              # quoting pairs
                                              # in a non-greedy way

           / ($ret = $1) =~ (s#,##g);         # remove all commas within quotes
             $ret                             # substitute the substitution :)
           /gex'

or in short

perl -pe 's/"(.+?[^\\])"/($ret = $1) =~ (s#,##g); $ret/ge'

You can either pipe the text you want to process to the command or specify the textfile to be processed as last command line argument.

Solution 4

I would use a language with a proper CSV parser. For example:

ruby -r csv -ne '
  CSV.parse($_) do |row|
    newrow = CSV::Row.new [], []
    row.each {|field| newrow << field.delete(",")}
    puts newrow.to_csv
  end
' < input_file

Solution 5

Your second quotes are misplaced:

sed -e 's/\(".*\),\(.*"\)/\1 \2/g'

In addition, using regular expressions tend to match the longest possible part of the text, meaning this will not work if you have more than one quoted field in the string.

A way that handles multiple quoted fields in sed

sed -e 's/\(\"[^",]\+\),\([^",]*\)/\1 \2/g' -e 's/\"//g'

This is also a way to solve this, however, with input that may contain more than one comma per quoted field the first expression in the sed would have to be repeated as many times as the maximum comma content in a single field, or until it does not change the output at all.

Running sed with the more than one expression should be more efficient than several sed processes running and a "tr" all running with open pipes.

However, this may have undesired consequences if the input is not properly formatted. i.e. nested quotes, unterminated quotes.

Using the running example:

echo '123,"ABC, DEV 23",345,534,"some more, comma-separated, words",202,NAME' \
| sed -e 's/\(\"[^",]\+\),\([^",]*\)/\1 \2/g' \
-e 's/\(\"[^",]\+\),\([^",]*\)/\1 \2/g' -e 's/\"//g'

Output:

123,ABC  DEV 23,345,534,some more  comma-separated  words,202,NAME

View more solutions

70,850

ekoeppen

Updated on September 18, 2022

Comments

ekoeppen over 1 year
I have a input file delimited with commas (,). There are some fields enclosed in double quotes that are having a comma in them. Here is the sample row
```
123,"ABC, DEV 23",345,534.202,NAME
```
I need to remove all the comma's occuring within inside the double quotes and the double quotes as well. So the above line should get parsed into as shown below
```
123,ABC DEV 23,345,534.202,NAME
```
I tried the following using sed but not giving expected results.
```
sed -e 's/$".*$,$".*$/\1 \2/g'
```
Any quick tricks with sed, awk or any other unix utility please?
tojrobinson over 11 years

The [^\\] is going to have the undesired effect of matching the last character inside the quotes and removing it (non \ character), i.e., you should not consume that character. Try (?<!\\) instead.
user1146332 over 11 years

Thanks for your objection, I have corrected that. Nevertheless I think we don't need look behind assertion here, or do we!?
tojrobinson over 11 years

Including the non \ in your capture group produces an equivalent result. +1
Alessio over 11 years

+1. after trying a few things with sed, I checked sed's docs and confirmed that it can't apply a replace to just the matching portion of a line...so gave up and tried perl. Ended up with a very similar approach but this version uses [^"]* to make the match non-greedy (i.e. matches everything from one " to the next "): perl -pe 's/"([^"]+)"/($match = $1) =~ (s:,::g);$match;/ge;'. It does not acknowledge the outlandish idea that a quote might be escaped with a backslash :-)
user1146332 over 11 years

Thanks for your comment. Would be interesting if either the [^"]* approach or the explicit non-greedy approach consumes less cpu time.
Thor over 11 years

@CraigSanders: with GNU sed you sort of can with the execute flag and sub-shelling, e.g.: sed -r 's/(.*)"([^"]+)"(.*)/(echo -n "\1"; echo -n "\2" | tr -d ','; echo -n "\3")/e'
Alessio over 11 years

@Thor: nice. I started playing with /e but couldn't get it to replace multiple commas inside quotes or cope with multiple quote-delimited sections on a line. also, perl is easier to read and understand when sed scripts start getting too complicated.
ekoeppen over 11 years

Please can you elaborate on gsub and explain in short, how this one liner works?? please.
Thor over 11 years

@CraigSanders: Agreed, that soon becomes too complicated. I did find a simpler two tier sed alternative which I added to my answer.
Stéphane Chazelas over 11 years

That solution doesn't work in cases like echo '"", "asd"' | perl -pe 's/"(.+?[^\\])"/($ret = $1) =~ (s#,##g); $ret/ge'
Thor over 11 years

You can make it more general with conditional branching and more readable with ERE, e.g. with GNU sed: sed -r ':r; s/("[^",]+),([^",]*)/\1 \2/g; tr; s/"//g'.
CocoaEv almost 10 years

Thank you! This scripts works really well, but could you explain the lonely 1 at the end of the script? -- } 1' --
Thor almost 10 years

@CocoaEv: It executes { print $0 }. I added that to the explanation as well.
tricasse about 9 years

Works also with nested quotes. Awesome, thanks!
Olivier Dulac over 7 years

this approach has a problem: sometimes the csv has rows that span several lines, such as: prefix,"something,otherthing[newline]something , else[newline]3rdline,and,things",suffix (ie: several lines, and nested "," anywhere within a multi-line double-quoting : the whole "...." part should be rejoined and inside , should be replaced/removed ... ) : your script won't see pairs of double quotes in that case, and it's not really easy to solve (need to "rejoin" lines that are in an "open" (ie, odd-numbered) double quote...+take extra care if there is also an escaped \" inside the string)
KIC over 7 years

while I liked this solution initially, it turned out to be incredible slow for big files ...
Danton Noriega over 5 years

Loved this solution but I tweaked it given I often like to keep the commas but still want to delimit. Instead, I switched the commas outside the quotes to pipes, converting the csv to a psv file: awk -F'"' -v OFS='"' '{ for (I=1; i<=NF; i+=2) gsub(",", "|", $i) } 1' infile
Anne van Rossum almost 4 years

Tried it on the command itself. :-) echo ':a;s/^$\([^"]*,\?\|"[^",]*",\?$*"[^",]*\),/\1 /;ta' | sed ':a;s/^$\([^"]*,\?\|"[^",]*",\?$*"[^",]*\),/\1 /;ta' becomes :a;s/^$\([^"]* \?\|"[^" ]*",\?$*"[^",]*\),/\1 /;ta
avocado over 2 years

echo '1,2,"abc,de,f",10' | sed ':a;s/"$[^",]*$,$.*$"/"\1 \2"/;ta'
they over 2 years

You'd use csvformat -d '^' to convert the file back to using commas as delimiters. However, what you're doing seems to work, but only because all non-delimiting commas are removed.