Replace every comma not enclosed in a pair of double quotes with '|'

10,388

Solution 1

First sample: Quick and dirty:

If your comma are ALWAY followed by a space in text strings, and NEVER in field separation, you could use:

sed -e 's/,\([^ ]\)/\|\1/g'
"Chang, Yao-Jen"|33|MIS|"Taiwan, Taipei"|M

but you have to be sure about next character.

More elaborated sample, whithout need of space, closest to your original idea.

sed -e ':a;s/^\(\("[^"]*"\|[^",]*\)*\),/\1|/;ta'

echo '"Chang, Yao-Jen",33,MIS,"Taiwan, Taipei",M' |
  sed -e ':a;s/^\(\("[^"]*"\|[^",]*\)*\),/\1|/;ta'
"Chang, Yao-Jen"|33|MIS|"Taiwan, Taipei"|M

echo '"Chang,Yao-Jen",33,MIS,"Taiwan,Taipei",M' |
  sed -e '1 { :a;s/^\(\("[^"]*"\|[^",]*\)*\),/\1|/;ta }'
"Chang,Yao-Jen"|33|MIS|"Taiwan,Taipei"|M

Explained:

sed -e '
    :a
    s/^\(\("[^"]*"\|[^",]*\)*\),/\1|/
    ta
'
  • :a is a address location for branch (loop)
  • s/ search from begin of line for '[^",]*,' or '"...",' than replace comma by vbar.
  • ta branch to a if previous s/ have been matched.

As you requested to operate on line 2, you will have to:

sed -e '2 { :a; s/^\(\("[^"]*"\|[^",]*\)*\),/\1|/; ta } '

Edited: [WRONG! See Edit 3]

Another sample if you wanna have mixed quotes and double-quotes:

There is a sample with mixed quoted, non-quoted, and one field containing a quote, but double quoted:

cat <<eof >sample
A,B,"C,D",E,"F,G",H,"I,J,K"
"Chang, Yao-Jen",33,MIS,"Taiwan, Taipei",M
A,B,'C,D',E,'F,G',H,'I,J,K'
'Chang, Yao-Jen',33,MIS,'Taiwan, Taipei',M
"Chang, Yao-Jen",33,MIS,"Taiwan, Taipei",M,'Chang,Yao-Jen',34,MZZ,'Taiwan, Taipei',Z
"Chang's son: Yao-Lu",55,MAA,'Taiwan, too',z
eof

sed -e ':a;s/^\(\(\(['\''"]\)[^\3]*\3\|[^",'\'']*\)*\),/\1|/;ta' sample
A|B|"C,D"|E|"F,G"|H|"I,J,K"
"Chang, Yao-Jen"|33|MIS|"Taiwan, Taipei"|M
A|B|'C,D'|E|'F,G'|H|'I,J,K'
'Chang, Yao-Jen'|33|MIS|'Taiwan, Taipei'|M
"Chang, Yao-Jen"|33|MIS|"Taiwan, Taipei"|M|'Chang,Yao-Jen'|34|MZZ|'Taiwan, Taipei'|Z
"Chang's son: Yao-Lu"|55|MAA|'Taiwan, too'|z

where sed script could by confined in a little more readable script file as:

cat <<oesedscript >csvtopsv.sed 
#!/bin/sed -f 
# Coma Separated Values to Pipe Separated Values
:a
s/^\(\(\(['"]\)[^\3]*\3\|[^",']*\)*\),/\1|/;
ta
oesedscript
chmod +x csvtopsv.sed

./csvtopsv.sed sample
A|B|"C,D"|E|"F|G"|H|"I|J|K"
"Chang, Yao-Jen"|33|MIS|"Taiwan, Taipei"|M
A|B|'C,D'|E|'F|G'|H|'I|J|K'
'Chang, Yao-Jen'|33|MIS|'Taiwan, Taipei'|M
"Chang, Yao-Jen"|33|MIS|"Taiwan, Taipei"|M|'Chang,Yao-Jen'|34|MZZ|'Taiwan, Taipei'|Z
"Chang's son: Yao-Lu"|55|MAA|'Taiwan, too'|z

Explained:

the s/ search for quote or double quote ['"] as third enclosed regex part, followed by 0 or more other character than mathing third enclosed part, finaly followed by a second character same as third regex part... OR no coma, single nor double quote [,'"]...

Edit 3 Warn! this was Wrong!:

So right answer seem definitively be something like:

sed -e ':a;s/^\(\(\(['\''"]\)[^\3]*\3\|[^",'\'']*\)*\),/\1|/;ta'

you could see my error in adding a ;L for debugging, before ta:

sed -e ':a;s/^\(\(\(['\''"]\)[^\3]*\3\|[^",'\'']*\)*\),/\1|/;L;ta'

where

echo '1,"John Doe","6, rue Peuh",236,"B,-,F,H,P,-",-55' |
  sed -e ':a;s/^\(\("[^"]*"\|'\''[^'\'']*'\''\|[^",'\'']*\)*\),/\1#/;L;ta'
1#"John Doe","6, rue Peuh",236,"B,-,F,H,P,-",-55
1#"John Doe"#"6, rue Peuh",236,"B,-,F,H,P,-",-55
1#"John Doe"#"6, rue Peuh"#236,"B,-,F,H,P,-",-55
1#"John Doe"#"6, rue Peuh"#236#"B,-,F,H,P,-",-55
1#"John Doe"#"6, rue Peuh"#236#"B,-,F,H,P,-"#-55
1#"John Doe"#"6, rue Peuh"#236#"B,-,F,H,P,-"#-55
1#"John Doe"#"6, rue Peuh"#236#"B,-,F,H,P,-"#-55

we can see that this is not as simple... [^\3] dont give intended effect, but instead match for not char 3.

Finaly, we have to search for each delimitor for himself:

:a;
s/^\(\("[^"]*"\|'[^']*'\|[^",']*\)*\),/\1\t/;
ta

Nota: from there, I will present csv2tsv as coma to tab separated values, if you really prefer to use | pipe as separator, you could replace \t by | or any char you want.

well commmand line is less sexy:

echo '1,"John Doe","6, rue Peuh",236,"B,-,F,H,P,-",-55' |
  sed -e ':a;s/^\(\("[^"]*"\|'\''[^'\'']*'\''\|[^",'\'']*\)*\),/\1\t/;L;ta' 
1       "John Doe","6, rue Peuh",236,"B,-,F,H,P,-",-55
1       "John Doe"      "6, rue Peuh",236,"B,-,F,H,P,-",-55
1       "John Doe"      "6, rue Peuh"   236,"B,-,F,H,P,-",-55
1       "John Doe"      "6, rue Peuh"   236     "B,-,F,H,P,-",-55
1       "John Doe"      "6, rue Peuh"   236     "B,-,F,H,P,-"   -55
1       "John Doe"      "6, rue Peuh"   236     "B,-,F,H,P,-"   -55
1   "John Doe"      "6, rue Peuh"   236     "B,-,F,H,P,-"   -55

But this match the need.

echo '1,"John Doe","6, rue Peuh",236,"B,-,F,H,P,-",-55' |
  sed -e ':a;s/^\(\("[^"]*"\|'\''[^'\'']*'\''\|[^",'\'']*\)*\),/\1\t/;ta' 
1       "John Doe"      "6, rue Peuh"   236     "B,-,F,H,P,-"   -55

In fine, creating sedscript:

cat >csv2tsv.sed <<eof
#!/bin/sed -f
# Coma separated values to Tab separated values

:a
s/^\(\("[^"]*"\|'[^']*'\|[^",']*\)*\),/\1\t/;
ta
eof

chmod +x csv2tsv.sed

Now:

cat >file.csv <<eof
A,B,"C,D",E,"F,G",H,"I,J,K"
"Chang, Yao-Jen",33,MIS,"Taiwan, Taipei",M
1,"John Doe","6, rue Peuh",236,"B,-,F,H,P,-",-55
4,"hacker's string",'one quote: "I have no special talents. I am only passionat\
ely curious." - Albert Einstein',unquoted string,9,1,1,3
eof

./csv2tsv.sed file.csv 
A   B       "C,D"   E      "F,G"    H    "I,J,K"
"Chang, Yao-Jen"    33     MIS      "Taiwan, Taipei"        M
1   "John Doe"      "6, rue Peuh"   236  "B,-,F,H,P,-"      -55
4   "hacker's string"      'one quote: "I have no special talents. I am only pa
ssionately curious." - Albert Einstein' unquoted string 9  1    1       3

Solution 2

Here's one way using GNU awk and the FPAT variable:

awk 'BEGIN { FPAT="([^,]+)|(\"[^\"]+\")"; OFS="|" } $1=$1' file

Results:

A|B|"C,D"|E|"F,G"|H|"I,J,K"
"Chang, Yao-Jen"|33|MIS|"Taiwan, Taipei"|M

Solution 3

$ awk 'BEGIN{FS=OFS="\""} {for (i=1;i<=NF;i+=2) gsub(/,/,"|",$i)}1' file
A|B|"C,D"|E|"F,G"|H|"I,J,K"
"Chang, Yao-Jen"|33|MIS|"Taiwan, Taipei"|M

Solution 4

Unless this is a learning exercise about sed, I'd use a language with a proper CSV parser, for example:

ruby -rcsv -ne '
    puts CSV.generate_line(CSV.parse_line($_), {:col_sep => "|"})
' filename

outputs

A|B|C,D|E|F,G|H|I,J,K
Chang, Yao-Jen|33|MIS|Taiwan, Taipei|M

The quotes have vanished. That's because there are no "interior" separators that require quoting. If some pipes appear in the input, then you'll see some fields double quoted in the output.

Share:
10,388
loganaayahee
Author by

loganaayahee

Updated on July 28, 2022

Comments

  • loganaayahee
    loganaayahee over 1 year

    I want to replace every comma not enclosed in a pair of double quotes with '|' using pattern matching within .

    For example, given the following input:

    A,B,"C,D",E,"F,G",H,"I,J,K"
    "Chang, Yao-Jen",33,MIS,"Taiwan, Taipei",M
    

    This is the desired output:

    A|B|"C,D"|E|"F,G"|H|"I,J,K"
    "Chang, Yao-Jen"|33|MIS|"Taiwan, Taipei"|M
    

    I want this accomplished without using hard coding like:

    sed '2s/33,MIS/33|MIS|/' file.
    
  • F. Hauri  - Give Up GitHub
    F. Hauri - Give Up GitHub over 11 years
    @PrinceJohnWesley Thanks, but I could do better ;)
  • F. Hauri  - Give Up GitHub
    F. Hauri - Give Up GitHub over 11 years
    @sputnick there is a new version more efficient
  • Ed Morton
    Ed Morton over 11 years
    I see it stripped all the double quotes from the file. Is there an option to leave those alone?
  • LSerni
    LSerni over 11 years
    +1 impressive performance :-)
  • glenn jackman
    glenn jackman over 11 years
    @EdMorton, thanks for your comment. I updated my answer a bit to have the CSV module generate the output line. That way, if there are any separators in the input, they'll be properly quoted in the output.
  • James Brown
    James Brown over 8 years
    This solution looses empty columns: A,B,,C,,,D would produce A|B|C|D
  • Steve
    Steve over 8 years
    @JamesBrown: Please see the link in my answer. "A straightforward modification (changing the first ‘+’ to ‘*’) allows fields to be empty". Instead use: awk -v FPAT="([^,]*)|(\"[^\"]+\")" -v OFS="|" '$NF=$NF' file
  • F. Hauri  - Give Up GitHub
    F. Hauri - Give Up GitHub about 8 years
    But this won't work with quotes like "I don't believe in mathematics." - Albert Einstein :-(