Remove quotes around integers in a csv file

6,622

Solution 1

Using perl:

perl -ne 's/"(\d+)"/$1/g; print' file.csv > new_file.txt

All the work is done by s/"(\d+)"/$1/g where

  • s/patternA/patternB/ is used to replace patternA by patternB
  • then perl looks for one or more digits \d+ surrounded by double quotes.
  • the parenthesis around (\d+) are used to capture the digit(s) and reuse them as a replacement pattern with perl special variable $1.

Solution 2

A GNU sed regex that should work for this case is

sed -r 's/"([0-9]+)"/\1/g'    

For pure sed you need to escape the grouping parentheses and + modifier

sed 's/"\([0-9]\+\)"/\1/g'

You can perform the substitution in-place with some versions of sed e.g.

sed -ri 's/"([0-9]+)"/\1/g' file.csv

You could also use the POSIX class [[:digit:]] in place of character range [0-9]

Solution 3

Your description of the problem is not very specific. I am assuming you want to remove the double quotes around the 1st and 3rd fields only. If so, any of these should work:

  1. sed

    sed -r 's/^"([^"]+)"(\s*,\s*[^,]+)\s*,\s*"([^"]+)"/\1\2, \3/' file.csv
    

    Explanation

    The -r enables extended regular expressions, allowing us to use parentheses to capture patterns without needing to escape them. So, we match a quote at the beginning of the line (^"), followed by one or more non-quote characters ([^"]+), then the closing quote, followed by 0 or more spaces, a comma, then 0 or more spaces again (\s*,\s*), then a stretch of non-commas until the next comma (this defines the 2nd field). Finally, we look for 0 or more spaces, a comma, and replace that with the 1st captured pattern (\1), then the 2nd (\2), a comma, a space and the 3rd.

  2. Perl

    perl -pe 's/^"([^"]+)"(\s*,\s*[^,]+)\s*,\s*"([^"]+)"/$1$2, $3/; ' file.csv
    

    Explanation

    The -p means print every line after applying the script passed by -e. The script itself is basically the same regex as in the sed above. Only here, the captured patterns is $1.

  3. awk

    awk -F, -v OFS="," '{gsub("\"","",$1)0gsub("\"","",$3);}1;' file.csv 
    

    Explanation

    The -F sets the field separator to ,. OFS is the output field separator which is also set to , so that the lines are printed correctly. The gsub makes the substitution, replacing all " with nothing since we run it on the 1st ($1) and 3rd fields ($3) it will only remove the quotes from those fields. The 1; is just awk shorthand for "print the line".

Solution 4

Python solution

The small script below takes file command-line argument, iterates over each line in that file, and splits each line into list of items using , as separator. Each entry is then unquoted and checked for being a numeric string; if a string is numeric , it is left unquoted.

#!/usr/bin/env python
import sys

with open(sys.argv[1]) as fp:
    for line in fp:
        new_vals = []
        vals = line.strip().split(',')
        for val in vals:
            val = val.strip().rstrip().replace('"','')
            if not val.isdigit(): 
               val = '"' + val  + '"'
            new_vals.append(val)
        print(",".join(new_vals))
               
            

Test run:

$ cat input.txt
"34432", "name", "0", "very long description" 
"1234", "othe name" , "42", "another description"
$ ./unquote_integers.py  input.txt                                       
34432,"name",0,"very long description"
1234,"othe name",42,"another description"

Additional notes:

It was asked in the comments , why the script removes double quotes around each item before evaluating if the item is numeric string or not. The main reason for that is because inclusion of double quotes will make item like "123" evaluate to False, i.e. non numeric. Effectively, we need to evaluate what's within the double quotes somehow. Now, there is alternative way to approach this via taking list slice of each value. However, that's not any better than using .replace() from the beginning. It does make code shorter, but at least in this case shortness of a script is irrelevant - our goal is to make the code work, not code-golf it.

Here's the alternative solution with list slices:

#!/usr/bin/env python
import sys

with open(sys.argv[1]) as fp:
    for line in fp:
        new_vals = []
        vals = line.strip().split(',')
        for val in vals:
            val = val.strip().rstrip() #remove extra spaces
            val = val.replace('"','') if val[1:-1].isdigit() else val
            new_vals.append(val)
        print(",".join(new_vals))
               
            
Share:
6,622

Related videos on Youtube

Sandeep
Author by

Sandeep

Software architect and developer with special interest in agile, DevOps and how people interact and work together in organizations. I have 10+ years experience in software development, and most of that time I spent in the Java ecosystem. I'm a big fan of automated testing, continuous integration and continuous deployment. Specialties: Java, Docker, Spring Boot, Restful services, micro services, Apache Kafka, Event sourcing, CQRS, ansible, agile, continuous everything! I'm also comfortable with: Kubernetes, Scala, NodeJS

Updated on September 18, 2022

Comments

  • Sandeep
    Sandeep over 1 year

    In a large (>1 gb) csv file I have something like

    "34432", "name", "0", "very long description"
    

    but instead of that I'd like to have

    34432, "name", 0, "very long description".
    

    I was looking at sed but this task is out of my scope.

    Any advice how to achieve this?

    • Sam
      Sam about 10 years
      Maybe unnecessary. Just load that csv into a database table where all columns are strings, then change the definition of the 1st and 3rd columns to integer after all the data is there. It should retain the values.
    • Sandeep
      Sandeep about 10 years
      Interesting suggestion @Sam!
  • Sandeep
    Sandeep about 10 years
    I wanted to trim the integers.
  • Sandeep
    Sandeep about 10 years
    Ok, thanks for your effort. At least I know how to clean the columns in the specific positions but actually I needed that regexp/capture thing on the integers.
  • terdon
    terdon about 10 years
    @BalázsMáriaNémeth in that case use the accepted answer but careful, \d does not only match integers, it will also match numbers like 0.5 and depending on your locale settings more obscure stuff as well.
  • Sandeep
    Sandeep about 7 years
    I don't understand the part where you replace the quotes and for non digits, you put them back. Why not just do the replace for digits and leave alone the rest? So val = val.strip().rstrip().replace('"','') if val.isdigit() else val. But I don't know if this is a valid expression in python.
  • Sergiy Kolodyazhnyy
    Sergiy Kolodyazhnyy about 7 years
    @BalázsMáriaNémeth Problem is the double quotes themselves. Because string includes quotes, it will be considered non-numeric. To determine if a string is digit, I would need to get rid of quotes somehow. And you can see that with simple test: python -c 'print("\"123\"".isdigit())' . This will print False. Now, there is indeed a way to shorten this, with using list slice. In python, string can be considered a list of characters, and your expression can be used with evaluating val[1:-1]. However, shortness here is irrelevant. We want the code to work, and that's it
  • Sergiy Kolodyazhnyy
    Sergiy Kolodyazhnyy about 7 years
    @BalázsMáriaNémeth and yes, that's a valid python expression
  • Sandeep
    Sandeep about 7 years
    thanks @Serg, you're absolutely right, it only seemed strange at first look :)