How to handle commas within a CSV file being read by bash script

14,877

Solution 1

If you want to do it all in awk (GNU awk 4 is required for this script to work as intended):

awk '{ 
 for (i = 0; ++i <= NF;) {
   substr($i, 1, 1) == "\"" && 
     $i = substr($i, 2, length($i) - 2)
   printf "[%s]%s", $i, (i < NF ? OFS : RS)
    }   
 }' FPAT='([^,]+)|("[^"]+")' infile

Sample output:

% cat infile
Australian Capital Territory,AU-ACT,20034,AU,Australia
Piaui,BR-PI,20100,BR,Brazil
"Adygeya, Republic",RU-AD,21250,RU,Russian Federation
% awk '{    
 for (i = 0; ++i <= NF;) {
   substr($i, 1, 1) == "\"" &&
     $i = substr($i, 2, length($i) - 2)
   printf "[%s]%s", $i, (i < NF ? OFS : RS)
    }
 }' FPAT='([^,]+)|("[^"]+")' infile
[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia]
[Piaui] [BR-PI] [20100] [BR] [Brazil]
[Adygeya, Republic] [RU-AD] [21250] [RU] [Russian Federation]

With Perl:

perl -MText::ParseWords -lne'
 print join " ", map "[$_]", 
   parse_line(",",0, $_);
  ' infile 

This should work with your awk version (based on this c.u.s. post, removed the embedded commas too).

awk '{
 n = parse_csv($0, data)
 for (i = 0; ++i <= n;) {
    gsub(/,/, " ", data[i])
    printf "[%s]%s", data[i], (i < n ? OFS : RS)
    }
  }
function parse_csv(str, array,   field, i) { 
  split( "", array )
  str = str ","
  while ( match(str, /[ \t]*("[^"]*(""[^"]*)*"|[^,]*)[ \t]*,/) ) { 
    field = substr(str, 1, RLENGTH)
    gsub(/^[ \t]*"?|"?[ \t]*,$/, "", field)
    gsub(/""/, "\"", field)
    array[++i] = field
    str = substr(str, RLENGTH + 1)
  }
  return i
}' infile

Solution 2

After looking at @Dimitre's solution over here. You can do something like this -

#!/usr/local/bin/gawk -f

BEGIN {
    FS="," 
    FPAT="([^,]+)|(\"[^\"]+\")"
    }

      {
    for (i=1;i<=NF;i++) 
        printf ("[%s] ",$i);
    print ""
    } 

Test:

[jaypal:~/Temp] cat filename
Australian Capital Territory,AU-ACT,20034,AU,Australia
Piaui,BR-PI,20100,BR,Brazil
"Adygeya, Republic",RU-AD,21250,RU,Russian Federation

[jaypal:~/Temp] ./script.awk  filename
[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia] 
[Piaui] [BR-PI] [20100] [BR] [Brazil] 
["Adygeya, Republic"] [RU-AD] [21250] [RU] [Russian Federation] 

For removing " you can pipe the output to sed.

[jaypal:~/Temp] ./script.awk  filename | sed 's#\"##g'
[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia] 
[Piaui] [BR-PI] [20100] [BR] [Brazil] 
[Adygeya, Republic] [RU-AD] [21250] [RU] [Russian Federation] 

Solution 3

After thinking about the problem, I realised that since the comma in the string isn't important to me, it'd be easier to simply remove it from the input before parsing.

To that end, I've concocted a sed command that matches strings surrounded by doubled quotes that contain a comma. The command then removes the bits you don't want from the matched string. It does this by separating the regex into remembered sections.

This solution only works where the string contains a single comma between double quotes.

The unescaped regex is

(")(.*)(,)(.*)(")

The first, third, and fifth pairs of parentheses capture the opening double quote, comma, and closing double quote respectively.

The second and third pairs of parentheses capture the actual content of the field which we want to keep.

sed Command To Remove Comma:

echo "$input" | sed 's/\(\"\)\(.*\)\(,\)\(.*\)\(\"\)/\1\2\3\4/' 

sed Command To Remove Comma and Double Quotes:

echo "$input" | sed 's/\(\"\)\(.*\)\(,\)\(.*\)\(\"\)/\2\3/' 

Updated Code:

tmpFile=$input"Temp"
sed 's/\(\"\)\(.*\)\(,\)\(.*\)\(\"\)/\2\4/' < $input > $tmpFile
while IFS=, read province provinceCode criteriaId countryCode country
do
    echo "[$province] [$provinceCode] [$criteriaId] [$countryCode] [$country]"
done < $tmpFile
rm $tmpFile

Output:

[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia]
[Piaui] [BR-PI] [20100] [BR] [Brazil]
[Adygeya Republic] [RU-AD] [21250] [RU] [Russian Federation]
[Bío-Bío] [CL-BI] [20154] [CL] [Chile]

Solution 4

If you can tolerate having the surrounding quotes persist in the output, you can use a small script I wrote called csvquote to enable awk and cut (and other UNIX text tools) to properly handle quoted fields that contain commas. You wrap the command like this:

csvquote inputfile.csv | awk -F, '{print "["$1"] ["$2"] ["$3"] ["$4"] ["$5"]"}' | csvquote -u

see https://github.com/dbro/csvquote for the code and documentation

Share:
14,877

Related videos on Youtube

chrisbunney
Author by

chrisbunney

Java and Python developer. Interested in software project management, agile development, quality assurance, and writing damn good code.

Updated on June 04, 2022

Comments

  • chrisbunney
    chrisbunney almost 2 years

    I'm creating a bash script to generate some output from a CSV file (I have over 1000 entries and don't fancy doing it by hand...).

    The content of the CSV file looks similar to this:

    Australian Capital Territory,AU-ACT,20034,AU,Australia
    Piaui,BR-PI,20100,BR,Brazil
    "Adygeya, Republic",RU-AD,21250,RU,Russian Federation
    

    I have some code that can separate the fields using the comma as delimiter, but some values actually contain commas, such as Adygeya, Republic. These values are surrounded by quotes to indicate the characters within should be treated as part of the field, but I don't know how to parse it to take this into account.

    Currently I have this loop:

    while IFS=, read province provinceCode criteriaId countryCode country
    do
        echo "[$province] [$provinceCode] [$criteriaId] [$countryCode] [$country]"
    done < $input
    

    which produces this output for the sample data given above:

    [Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia]
    [Piaui] [BR-PI] [20100] [BR] [Brazil]
    ["Adygeya] [ Republic"] [RU-AD] [21250] [RU,Russian Federation]
    

    As you can see, the third entry is parsed incorrectly. I want it to output

    [Adygeya Republic] [RU-AD] [21250] [RU] [Russian Federation]
    
  • chrisbunney
    chrisbunney over 12 years
    Thanks, not sure why this is community wiki, but will check this out :)
  • jaypal singh
    jaypal singh over 12 years
    @chrisbunney since I took dimitre's solution as a reference I thought it would be unapproprite to take credit for this answer. :)
  • chrisbunney
    chrisbunney about 12 years
    Just tested this, it doesn't produce the same output for me that it does for you. In fact, it produces the same "bad" output that I described in my question
  • chrisbunney
    chrisbunney about 12 years
    Thanks, it seems my installation of Debian 6 isn't using awk 4, I assumed the package would have a more recent version of awk
  • Dimitre Radoulov
    Dimitre Radoulov about 12 years
    You could try the Perl solution I've just added.
  • chrisbunney
    chrisbunney about 12 years
    Accepted and +1 as I think this is best solution, even if it's not the one that I can use in this case
  • Dimitre Radoulov
    Dimitre Radoulov about 12 years
    Hi @chrisbunney, I've added version that should work with your awk version.
  • jaypal singh
    jaypal singh about 12 years
    @chrisbunney Looks like awk version issue. I tested it on gnu-awk v 4.0.0
  • chrisbunney
    chrisbunney about 12 years
    Yes, with a bit of help from @Dimitre, it turns out there's an older version of awk on my machine