Remove comma between the quotes only in a comma delimited file
Solution 1
If the quotes are balanced, you will want to remove commas between every other quote, this can be expressed in awk
like this:
awk -F'"' -v OFS='' '{ for (i=2; i<=NF; i+=2) gsub(",", "", $i) } 1' infile
Output:
123,ABC DEV 23,345,534.202,NAME
Explanation
The -F"
makes awk separate the line at the double-quote signs, which means every other field will be the inter-quote text. The for-loop runs gsub
, short for globally substitute, on every other field, replacing comma (","
) with nothing (""
). The 1
at the end invokes the default code-block: { print $0 }
.
Solution 2
There is a good response, using sed simply one time with a loop:
echo '123,"ABC, DEV 23",345,534,"some more, comma-separated, words",202,NAME'|
sed ':a;s/^\(\([^"]*,\?\|"[^",]*",\?\)*"[^",]*\),/\1 /;ta'
123,"ABC DEV 23",345,534,"some more comma-separated words",202,NAME
Explanation:
:a;
is a label for furter branchs/^\(\([^"]*,\?\|"[^",]*",\?\)*"[^",]*\),/\1 /
could contain 3 enclosed parts- first the 2nd:
[^"]*,\?\|"[^",]*",\?
match for a string containing no double quote, maybe followed by a coma or a string enclosed by two double quote, without coma and maybe followed by a coma. - than the first RE part is composed by as many repetition of previously described part 2, followed by 1 double quote and some caracteres, but no double-quote, nor comas.
- The first RE part as to be followed by a coma.
- Nota, the rest of the line don't need to be touched
- first the 2nd:
ta
will loop to:a
if previouss/
command did some change.
Once loop done, you could even add s/ */ /g
:
echo '123,"ABC, DEV 23",345,534,"some more, comma-separated, words",202,NAME'|
sed ':a;s/^\(\([^"]*,\?\|"[^",]*",\?\)*"[^",]*\),/\1 /;ta;s/ */ /g'
will suppress double spaces:
123,"ABC DEV 23",345,534,"some more comma-separated words",202,NAME
Solution 3
A general solution that can also handle several commas between balanced quotes needs a nested substitution. I implement a solution in perl, which process every line of a given input and only substitute commas in every other pair of quotes:
perl -pe 's/ " (.+? [^\\]) " # find all non escaped
# quoting pairs
# in a non-greedy way
/ ($ret = $1) =~ (s#,##g); # remove all commas within quotes
$ret # substitute the substitution :)
/gex'
or in short
perl -pe 's/"(.+?[^\\])"/($ret = $1) =~ (s#,##g); $ret/ge'
You can either pipe the text you want to process to the command or specify the textfile to be processed as last command line argument.
Solution 4
I would use a language with a proper CSV parser. For example:
ruby -r csv -ne '
CSV.parse($_) do |row|
newrow = CSV::Row.new [], []
row.each {|field| newrow << field.delete(",")}
puts newrow.to_csv
end
' < input_file
Solution 5
Your second quotes are misplaced:
sed -e 's/\(".*\),\(.*"\)/\1 \2/g'
In addition, using regular expressions tend to match the longest possible part of the text, meaning this will not work if you have more than one quoted field in the string.
A way that handles multiple quoted fields in sed
sed -e 's/\(\"[^",]\+\),\([^",]*\)/\1 \2/g' -e 's/\"//g'
This is also a way to solve this, however, with input that may contain more than one comma per quoted field the first expression in the sed would have to be repeated as many times as the maximum comma content in a single field, or until it does not change the output at all.
Running sed with the more than one expression should be more efficient than several sed processes running and a "tr" all running with open pipes.
However, this may have undesired consequences if the input is not properly formatted. i.e. nested quotes, unterminated quotes.
Using the running example:
echo '123,"ABC, DEV 23",345,534,"some more, comma-separated, words",202,NAME' \
| sed -e 's/\(\"[^",]\+\),\([^",]*\)/\1 \2/g' \
-e 's/\(\"[^",]\+\),\([^",]*\)/\1 \2/g' -e 's/\"//g'
Output:
123,ABC DEV 23,345,534,some more comma-separated words,202,NAME
Related videos on Youtube
ekoeppen
Updated on September 18, 2022Comments
-
ekoeppen over 1 year
I have a input file delimited with commas (
,
). There are some fields enclosed in double quotes that are having a comma in them. Here is the sample row123,"ABC, DEV 23",345,534.202,NAME
I need to remove all the comma's occuring within inside the double quotes and the double quotes as well. So the above line should get parsed into as shown below
123,ABC DEV 23,345,534.202,NAME
I tried the following using
sed
but not giving expected results.sed -e 's/\(".*\),\(".*\)/\1 \2/g'
Any quick tricks with
sed
,awk
or any other unix utility please? -
tojrobinson over 11 yearsThe
[^\\]
is going to have the undesired effect of matching the last character inside the quotes and removing it (non \ character), i.e., you should not consume that character. Try(?<!\\)
instead. -
user1146332 over 11 yearsThanks for your objection, I have corrected that. Nevertheless I think we don't need look behind assertion here, or do we!?
-
tojrobinson over 11 yearsIncluding the non \ in your capture group produces an equivalent result. +1
-
Alessio over 11 years+1. after trying a few things with sed, I checked sed's docs and confirmed that it can't apply a replace to just the matching portion of a line...so gave up and tried perl. Ended up with a very similar approach but this version uses
[^"]*
to make the match non-greedy (i.e. matches everything from one"
to the next"
):perl -pe 's/"([^"]+)"/($match = $1) =~ (s:,::g);$match;/ge;'
. It does not acknowledge the outlandish idea that a quote might be escaped with a backslash :-) -
user1146332 over 11 yearsThanks for your comment. Would be interesting if either the
[^"]*
approach or the explicit non-greedy approach consumes less cpu time. -
Thor over 11 years@CraigSanders: with GNU sed you sort of can with the execute flag and sub-shelling, e.g.:
sed -r 's/(.*)"([^"]+)"(.*)/(echo -n "\1"; echo -n "\2" | tr -d ','; echo -n "\3")/e'
-
Alessio over 11 years@Thor: nice. I started playing with /e but couldn't get it to replace multiple commas inside quotes or cope with multiple quote-delimited sections on a line. also, perl is easier to read and understand when sed scripts start getting too complicated.
-
ekoeppen over 11 yearsPlease can you elaborate on
gsub
and explain in short, how this one liner works?? please. -
Thor over 11 years@CraigSanders: Agreed, that soon becomes too complicated. I did find a simpler two tier sed alternative which I added to my answer.
-
Stéphane Chazelas over 11 yearsThat solution doesn't work in cases like
echo '"", "asd"' | perl -pe 's/"(.+?[^\\])"/($ret = $1) =~ (s#,##g); $ret/ge'
-
Thor over 11 yearsYou can make it more general with conditional branching and more readable with ERE, e.g. with GNU sed:
sed -r ':r; s/("[^",]+),([^",]*)/\1 \2/g; tr; s/"//g'
. -
CocoaEv almost 10 yearsThank you! This scripts works really well, but could you explain the lonely 1 at the end of the script? -- } 1' --
-
Thor almost 10 years@CocoaEv: It executes
{ print $0 }
. I added that to the explanation as well. -
tricasse about 9 yearsWorks also with nested quotes. Awesome, thanks!
-
Olivier Dulac over 7 yearsthis approach has a problem: sometimes the csv has rows that span several lines, such as:
prefix,"something,otherthing[newline]something , else[newline]3rdline,and,things",suffix
(ie: several lines, and nested "," anywhere within a multi-line double-quoting : the whole"...."
part should be rejoined and inside,
should be replaced/removed ... ) : your script won't see pairs of double quotes in that case, and it's not really easy to solve (need to "rejoin" lines that are in an "open" (ie, odd-numbered) double quote...+take extra care if there is also an escaped\"
inside the string) -
KIC over 7 yearswhile I liked this solution initially, it turned out to be incredible slow for big files ...
-
Danton Noriega over 5 yearsLoved this solution but I tweaked it given I often like to keep the commas but still want to delimit. Instead, I switched the commas outside the quotes to pipes, converting the csv to a psv file:
awk -F'"' -v OFS='"' '{ for (I=1; i<=NF; i+=2) gsub(",", "|", $i) } 1' infile
-
Anne van Rossum almost 4 yearsTried it on the command itself. :-)
echo ':a;s/^\(\([^"]*,\?\|"[^",]*",\?\)*"[^",]*\),/\1 /;ta' | sed ':a;s/^\(\([^"]*,\?\|"[^",]*",\?\)*"[^",]*\),/\1 /;ta'
becomes:a;s/^\(\([^"]* \?\|"[^" ]*",\?\)*"[^",]*\),/\1 /;ta
-
avocado over 2 years
echo '1,2,"abc,de,f",10' | sed ':a;s/"\([^",]*\),\(.*\)"/"\1 \2"/;ta'
-
they over 2 yearsYou'd use
csvformat -d '^'
to convert the file back to using commas as delimiters. However, what you're doing seems to work, but only because all non-delimiting commas are removed.