How to remove symbols from a column using awk

text-processing sed awk

Solution 1

Using gsub:

awk '{gsub(/\"|\;/,"")}1' file
chr1    134901  139379  -   ENSG00000237683.5
chr1    860260  879955  +   ENSG00000187634.6
chr1    861264  866445  -   ENSG00000268179.1
chr1    879584  894689  -   ENSG00000188976.6
chr1    895967  901095  +   ENSG00000187961.9

If you want to operate only on the fifth field and preserve any quotes or semicolons in other fields:

awk '{gsub(/\"|\;/,"",$5)}1' file

Solution 2

If your data is formatted exactly as shown (i.e. no other " or ; in other columns that need to be preserved), then you can simply use tr to remove these characters:

tr -d '";' < input.txt > output.txt

Solution 3

Using sed to remove all instances of '";': sed -i 's/[";]//g' file

To only remove from 5th column sed is probably not the best option.

Solution 4

I know the original post asked for sed or awk but if you want to remove the " and ; from only the fifth column I'd use regex and php. There's probably a way to do this in AWK but I like to use the easiest tools.

<?php

foreach(file($argv[1]) as $line){

    $matches = array();
    preg_match('/^(\w+)\s+(\d+)\s+(\d+)\s+(\-|\+)\s+"(\w+.\d)"\;/',$line,$matches);
    $matched_line = array_shift($matches); // remove the first element
    vprintf("%s\t%s\t%s\t%s\t%s\n",$matches);
}

this would output this

$ php /tmp/preg_replace.php /tmp/data
chr1    134901  139379  -   ENSG00000237683.5
chr1    860260  879955  +   ENSG00000187634.6
chr1    861264  866445  -   ENSG00000268179.1
chr1    879584  894689  -   ENSG00000188976.6
chr1    895967  901095  +   ENSG00000187961.9

Solution 5

A sed solution that makes sure we're only fiddling around with the fifth column:

sed -E 's/^(([^ ]+ +){4})"([^"]+)";$/\1\3/' infile
chr1    134901  139379  -   ENSG00000237683.5
chr1    860260  879955  +   ENSG00000187634.6
chr1    861264  866445  -   ENSG00000268179.1
chr1    879584  894689  -   ENSG00000188976.6
chr1    895967  901095  +   ENSG00000187961.9

This works also without ERE (-E, or -r for some older sed), but requires a lot more backslashes. The +-quantifier is ERE-only according to the POSIX spec¹ and can be replaced by {1,} (or \{1,\} for BRE).

In case the columns aren't space-separated, the spaces can be replaced by the [:blank:] POSIX character class to also match tabs.

The regex in detail:

^               # Anchored at start of line
(               # Capture group 1 for first 4 columns
    (           # Capture group 2 for repeat count
        [^ ]+   # 1 or more non-spaces
         +      # 1 or more spaces
    ){4}        # 4 times "word plus spaces" (columns)
)               # End capture group 1
"               # Column 5 starts with double quote (not captured)
(               # Capture group 3 for column 5
    [^"]+       # One or more non-quote characters
)               # End capture group 3
";              # Quote and semicolon at end of column 5
$               # Anchored at end of line

¹ GNU sed, as an extension, allows \+ to be used in BRE as well.

View more solutions

Damian Romard

Updated on September 18, 2022

Comments

Damian Romard over 1 year
I have a playlist text file. I'm trying to extract a list of the artists and their songs. There are 39 line items and they appear as:

Rush - Red Sector A
Blues Traveler - Hook

This is a unicode file.

I'm trying to use the '-' as the delimiter and split the lines there:
```
x = open(u'list.txt')

for line in x:

    line = line.strip()

    elements = line.split('-')
    artist = elements[0]
    song = elements[1]
```
I get a traceback:
```
Traceback (most recent call last):
  File "playlist.py", line 34, in <module>
    song = line[1]
IndexError: list index out of range
```
It appears the delimiter is not being recognized. If I comment out "song = elements[1]" and print artists, I get the full line of text, delimiter and all. I've seen similar questions, but I can't get enough insight from their solutions to make this work. Any help would be appreciated.
- jonrsharpe over 9 years
  
  Are you sure you have the right dash? Try to cut and paste the precise symbol from the file you're reading.
- Damian Romard over 9 years
  
  I think the it's not seeing a dash, but some representation of the dash, in unicode: \xe2
- El Bert over 9 years
  
  Using your current example it works "Rush - Red Sector A".split("-") gives me ['Rush ', ' Red Sector A'] but with the string you had before you edit your question it wasn't working "Jace Everett – Bad Things Yes – Owner Of A Lonely Heart".split("-") gives me ['Jace Everett \xe2\x80\x93 Bad Things Yes \xe2\x80\x93 Owner Of A Lonely Heart']. Follow @jonrsharpe idea of using the symbol from the file directly
- Damian Romard over 9 years
  
  that's what I see too. If I copy and paste the dash for @jonrsharpe i get File "playlist.py", line 30 SyntaxError: Non-ASCII character '\xe2' in file playlist.py on line 30, but no encoding declared
- Admin over 8 years
  
  @DigitalTrauma ya, but Dani_l already gave that solution.
Aphid over 9 years

This has also been discussed here, and is covered in PEP 0263
Damian Romard over 9 years

Just a note for noobs like me, the encoding notation needs to be all the way at the top of the script. Location, location, location :)
Dani_l over 8 years

This would remove from all columns, not just 5th, no?
System over 8 years

This is what I thought initally, but after using the code it seemed to keep all columns.
jasonwryan over 8 years

@Dani_l Yes, it can be refined to operate only on the fifth field, but that was not a requirement...
System over 8 years

Sorry I must have not made it clear, I DO want to keep all columns. This is why it is marked as the answer.
jasonwryan over 8 years

@System updated to ensure it only operates on the fifth field.
jasonwryan over 8 years

I'm not sure how this satisfies the "easiest tools" criteria; just the amont of typing alone...
jbrahy over 8 years

I prefer php to awk and sed and this is the only answer that actually does what the original post requested by removing " and ; from only the fifth column. Give me that point back.
jasonwryan over 8 years

I wasn't the downvoter, and no, my edited answer also only operates on the fifth field (and has other advantages besides brevity)...
jbrahy over 8 years

ah, ok. I didn't see the edited version. $5 is definitely less typing. For me PHP code is easier so I provided a solution I thought would help someone.
jasonwryan over 8 years

Fair enough, it is always good to see solutions using different approaches...
Wildcard over 8 years

Why not just use a character class? /[;"]/ is a lot more readable and simpler in my opinion than /\"|\;/.
jasonwryan over 8 years

@Wildcard didn't think of it, but you are right, a bracket expression would be a better/more legible solution...