How to handle commas within a CSV file being read by bash script
Solution 1
If you want to do it all in awk (GNU awk 4 is required for this script to work as intended):
awk '{
for (i = 0; ++i <= NF;) {
substr($i, 1, 1) == "\"" &&
$i = substr($i, 2, length($i) - 2)
printf "[%s]%s", $i, (i < NF ? OFS : RS)
}
}' FPAT='([^,]+)|("[^"]+")' infile
Sample output:
% cat infile
Australian Capital Territory,AU-ACT,20034,AU,Australia
Piaui,BR-PI,20100,BR,Brazil
"Adygeya, Republic",RU-AD,21250,RU,Russian Federation
% awk '{
for (i = 0; ++i <= NF;) {
substr($i, 1, 1) == "\"" &&
$i = substr($i, 2, length($i) - 2)
printf "[%s]%s", $i, (i < NF ? OFS : RS)
}
}' FPAT='([^,]+)|("[^"]+")' infile
[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia]
[Piaui] [BR-PI] [20100] [BR] [Brazil]
[Adygeya, Republic] [RU-AD] [21250] [RU] [Russian Federation]
With Perl:
perl -MText::ParseWords -lne'
print join " ", map "[$_]",
parse_line(",",0, $_);
' infile
This should work with your awk version (based on this c.u.s. post, removed the embedded commas too).
awk '{
n = parse_csv($0, data)
for (i = 0; ++i <= n;) {
gsub(/,/, " ", data[i])
printf "[%s]%s", data[i], (i < n ? OFS : RS)
}
}
function parse_csv(str, array, field, i) {
split( "", array )
str = str ","
while ( match(str, /[ \t]*("[^"]*(""[^"]*)*"|[^,]*)[ \t]*,/) ) {
field = substr(str, 1, RLENGTH)
gsub(/^[ \t]*"?|"?[ \t]*,$/, "", field)
gsub(/""/, "\"", field)
array[++i] = field
str = substr(str, RLENGTH + 1)
}
return i
}' infile
Solution 2
After looking at @Dimitre's solution over here. You can do something like this -
#!/usr/local/bin/gawk -f
BEGIN {
FS=","
FPAT="([^,]+)|(\"[^\"]+\")"
}
{
for (i=1;i<=NF;i++)
printf ("[%s] ",$i);
print ""
}
Test:
[jaypal:~/Temp] cat filename
Australian Capital Territory,AU-ACT,20034,AU,Australia
Piaui,BR-PI,20100,BR,Brazil
"Adygeya, Republic",RU-AD,21250,RU,Russian Federation
[jaypal:~/Temp] ./script.awk filename
[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia]
[Piaui] [BR-PI] [20100] [BR] [Brazil]
["Adygeya, Republic"] [RU-AD] [21250] [RU] [Russian Federation]
For removing "
you can pipe the output to sed
.
[jaypal:~/Temp] ./script.awk filename | sed 's#\"##g'
[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia]
[Piaui] [BR-PI] [20100] [BR] [Brazil]
[Adygeya, Republic] [RU-AD] [21250] [RU] [Russian Federation]
Solution 3
After thinking about the problem, I realised that since the comma in the string isn't important to me, it'd be easier to simply remove it from the input before parsing.
To that end, I've concocted a sed
command that matches strings surrounded by doubled quotes that contain a comma. The command then removes the bits you don't want from the matched string. It does this by separating the regex into remembered sections.
This solution only works where the string contains a single comma between double quotes.
The unescaped regex is
(")(.*)(,)(.*)(")
The first, third, and fifth pairs of parentheses capture the opening double quote, comma, and closing double quote respectively.
The second and third pairs of parentheses capture the actual content of the field which we want to keep.
sed
Command To Remove Comma:
echo "$input" | sed 's/\(\"\)\(.*\)\(,\)\(.*\)\(\"\)/\1\2\3\4/'
sed
Command To Remove Comma and Double Quotes:
echo "$input" | sed 's/\(\"\)\(.*\)\(,\)\(.*\)\(\"\)/\2\3/'
Updated Code:
tmpFile=$input"Temp"
sed 's/\(\"\)\(.*\)\(,\)\(.*\)\(\"\)/\2\4/' < $input > $tmpFile
while IFS=, read province provinceCode criteriaId countryCode country
do
echo "[$province] [$provinceCode] [$criteriaId] [$countryCode] [$country]"
done < $tmpFile
rm $tmpFile
Output:
[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia]
[Piaui] [BR-PI] [20100] [BR] [Brazil]
[Adygeya Republic] [RU-AD] [21250] [RU] [Russian Federation]
[Bío-Bío] [CL-BI] [20154] [CL] [Chile]
Solution 4
If you can tolerate having the surrounding quotes persist in the output, you can use a small script I wrote called csvquote to enable awk and cut (and other UNIX text tools) to properly handle quoted fields that contain commas. You wrap the command like this:
csvquote inputfile.csv | awk -F, '{print "["$1"] ["$2"] ["$3"] ["$4"] ["$5"]"}' | csvquote -u
see https://github.com/dbro/csvquote for the code and documentation
Related videos on Youtube
chrisbunney
Java and Python developer. Interested in software project management, agile development, quality assurance, and writing damn good code.
Updated on June 04, 2022Comments
-
chrisbunney almost 2 years
I'm creating a bash script to generate some output from a CSV file (I have over 1000 entries and don't fancy doing it by hand...).
The content of the CSV file looks similar to this:
Australian Capital Territory,AU-ACT,20034,AU,Australia Piaui,BR-PI,20100,BR,Brazil "Adygeya, Republic",RU-AD,21250,RU,Russian Federation
I have some code that can separate the fields using the comma as delimiter, but some values actually contain commas, such as
Adygeya, Republic
. These values are surrounded by quotes to indicate the characters within should be treated as part of the field, but I don't know how to parse it to take this into account.Currently I have this loop:
while IFS=, read province provinceCode criteriaId countryCode country do echo "[$province] [$provinceCode] [$criteriaId] [$countryCode] [$country]" done < $input
which produces this output for the sample data given above:
[Australian Capital Territory] [AU-ACT] [20034] [AU] [Australia] [Piaui] [BR-PI] [20100] [BR] [Brazil] ["Adygeya] [ Republic"] [RU-AD] [21250] [RU,Russian Federation]
As you can see, the third entry is parsed incorrectly. I want it to output
[Adygeya Republic] [RU-AD] [21250] [RU] [Russian Federation]
-
Tom Whittock over 12 years
-
-
chrisbunney over 12 yearsThanks, not sure why this is community wiki, but will check this out :)
-
jaypal singh over 12 years@chrisbunney since I took dimitre's solution as a reference I thought it would be unapproprite to take credit for this answer. :)
-
chrisbunney about 12 yearsJust tested this, it doesn't produce the same output for me that it does for you. In fact, it produces the same "bad" output that I described in my question
-
chrisbunney about 12 yearsThanks, it seems my installation of Debian 6 isn't using awk 4, I assumed the package would have a more recent version of awk
-
Dimitre Radoulov about 12 yearsYou could try the Perl solution I've just added.
-
chrisbunney about 12 yearsAccepted and +1 as I think this is best solution, even if it's not the one that I can use in this case
-
Dimitre Radoulov about 12 yearsHi @chrisbunney, I've added version that should work with your awk version.
-
jaypal singh about 12 years@chrisbunney Looks like
awk
version issue. I tested it ongnu-awk v 4.0.0
-
chrisbunney about 12 yearsYes, with a bit of help from @Dimitre, it turns out there's an older version of awk on my machine