Remove quotes around integers in a csv file
Solution 1
Using perl:
perl -ne 's/"(\d+)"/$1/g; print' file.csv > new_file.txt
All the work is done by s/"(\d+)"/$1/g
where
s/patternA/patternB/
is used to replacepatternA
bypatternB
- then perl looks for one or more digits
\d+
surrounded by double quotes. - the parenthesis around (
\d+
) are used to capture the digit(s) and reuse them as a replacement pattern with perl special variable$1
.
Solution 2
A GNU sed regex that should work for this case is
sed -r 's/"([0-9]+)"/\1/g'
For pure sed you need to escape the grouping parentheses and +
modifier
sed 's/"\([0-9]\+\)"/\1/g'
You can perform the substitution in-place with some versions of sed e.g.
sed -ri 's/"([0-9]+)"/\1/g' file.csv
You could also use the POSIX class [[:digit:]]
in place of character range [0-9]
Solution 3
Your description of the problem is not very specific. I am assuming you want to remove the double quotes around the 1st and 3rd fields only. If so, any of these should work:
sed
sed -r 's/^"([^"]+)"(\s*,\s*[^,]+)\s*,\s*"([^"]+)"/\1\2, \3/' file.csv
Explanation
The
-r
enables extended regular expressions, allowing us to use parentheses to capture patterns without needing to escape them. So, we match a quote at the beginning of the line (^"
), followed by one or more non-quote characters ([^"]+
), then the closing quote, followed by 0 or more spaces, a comma, then 0 or more spaces again (\s*,\s*
), then a stretch of non-commas until the next comma (this defines the 2nd field). Finally, we look for 0 or more spaces, a comma, and replace that with the 1st captured pattern (\1
), then the 2nd (\2
), a comma, a space and the 3rd.Perl
perl -pe 's/^"([^"]+)"(\s*,\s*[^,]+)\s*,\s*"([^"]+)"/$1$2, $3/; ' file.csv
Explanation
The
-p
means print every line after applying the script passed by-e
. The script itself is basically the same regex as in thesed
above. Only here, the captured patterns is$1
.awk
awk -F, -v OFS="," '{gsub("\"","",$1)0gsub("\"","",$3);}1;' file.csv
Explanation
The
-F
sets the field separator to,
.OFS
is the output field separator which is also set to,
so that the lines are printed correctly. Thegsub
makes the substitution, replacing all"
with nothing since we run it on the 1st ($1
) and 3rd fields ($3
) it will only remove the quotes from those fields. The1;
is justawk
shorthand for "print the line".
Solution 4
Python solution
The small script below takes file command-line argument, iterates over each line in that file, and splits each line into list of items using ,
as separator. Each entry is then unquoted and checked for being a numeric string; if a string is numeric , it is left unquoted.
#!/usr/bin/env python
import sys
with open(sys.argv[1]) as fp:
for line in fp:
new_vals = []
vals = line.strip().split(',')
for val in vals:
val = val.strip().rstrip().replace('"','')
if not val.isdigit():
val = '"' + val + '"'
new_vals.append(val)
print(",".join(new_vals))
Test run:
$ cat input.txt
"34432", "name", "0", "very long description"
"1234", "othe name" , "42", "another description"
$ ./unquote_integers.py input.txt
34432,"name",0,"very long description"
1234,"othe name",42,"another description"
Additional notes:
It was asked in the comments , why the script removes double quotes around each item before evaluating if the item is numeric string or not. The main reason for that is because inclusion of double quotes will make item like "123"
evaluate to False
, i.e. non numeric. Effectively, we need to evaluate what's within the double quotes somehow. Now, there is alternative way to approach this via taking list slice of each value. However, that's not any better than using .replace()
from the beginning. It does make code shorter, but at least in this case shortness of a script is irrelevant - our goal is to make the code work, not code-golf it.
Here's the alternative solution with list slices:
#!/usr/bin/env python
import sys
with open(sys.argv[1]) as fp:
for line in fp:
new_vals = []
vals = line.strip().split(',')
for val in vals:
val = val.strip().rstrip() #remove extra spaces
val = val.replace('"','') if val[1:-1].isdigit() else val
new_vals.append(val)
print(",".join(new_vals))
Related videos on Youtube
Sandeep
Software architect and developer with special interest in agile, DevOps and how people interact and work together in organizations. I have 10+ years experience in software development, and most of that time I spent in the Java ecosystem. I'm a big fan of automated testing, continuous integration and continuous deployment. Specialties: Java, Docker, Spring Boot, Restful services, micro services, Apache Kafka, Event sourcing, CQRS, ansible, agile, continuous everything! I'm also comfortable with: Kubernetes, Scala, NodeJS
Updated on September 18, 2022Comments
-
Sandeep over 1 year
In a large (>1 gb) csv file I have something like
"34432", "name", "0", "very long description"
but instead of that I'd like to have
34432, "name", 0, "very long description".
I was looking at
sed
but this task is out of my scope.Any advice how to achieve this?
-
Sam about 10 yearsMaybe unnecessary. Just load that csv into a database table where all columns are strings, then change the definition of the 1st and 3rd columns to integer after all the data is there. It should retain the values.
-
Sandeep about 10 yearsInteresting suggestion @Sam!
-
-
Sandeep about 10 yearsI wanted to trim the integers.
-
Sandeep about 10 yearsOk, thanks for your effort. At least I know how to clean the columns in the specific positions but actually I needed that regexp/capture thing on the integers.
-
terdon about 10 years@BalázsMáriaNémeth in that case use the accepted answer but careful,
\d
does not only match integers, it will also match numbers like0.5
and depending on your locale settings more obscure stuff as well. -
Sandeep about 7 yearsI don't understand the part where you replace the quotes and for non digits, you put them back. Why not just do the replace for digits and leave alone the rest? So
val = val.strip().rstrip().replace('"','') if val.isdigit() else val
. But I don't know if this is a valid expression in python. -
Sergiy Kolodyazhnyy about 7 years@BalázsMáriaNémeth Problem is the double quotes themselves. Because string includes quotes, it will be considered non-numeric. To determine if a string is digit, I would need to get rid of quotes somehow. And you can see that with simple test:
python -c 'print("\"123\"".isdigit())'
. This will printFalse
. Now, there is indeed a way to shorten this, with using list slice. In python, string can be considered a list of characters, and your expression can be used with evaluatingval[1:-1]
. However, shortness here is irrelevant. We want the code to work, and that's it -
Sergiy Kolodyazhnyy about 7 years@BalázsMáriaNémeth and yes, that's a valid python expression
-
Sandeep about 7 yearsthanks @Serg, you're absolutely right, it only seemed strange at first look :)