How to remove symbols from a column using awk

57

Solution 1

Using gsub:

awk '{gsub(/\"|\;/,"")}1' file
chr1    134901  139379  -   ENSG00000237683.5
chr1    860260  879955  +   ENSG00000187634.6
chr1    861264  866445  -   ENSG00000268179.1
chr1    879584  894689  -   ENSG00000188976.6
chr1    895967  901095  +   ENSG00000187961.9

If you want to operate only on the fifth field and preserve any quotes or semicolons in other fields:

awk '{gsub(/\"|\;/,"",$5)}1' file 

Solution 2

If your data is formatted exactly as shown (i.e. no other " or ; in other columns that need to be preserved), then you can simply use tr to remove these characters:

tr -d '";' < input.txt > output.txt

Solution 3

Using sed to remove all instances of '";': sed -i 's/[";]//g' file

To only remove from 5th column sed is probably not the best option.

Solution 4

I know the original post asked for sed or awk but if you want to remove the " and ; from only the fifth column I'd use regex and php. There's probably a way to do this in AWK but I like to use the easiest tools.

<?php

foreach(file($argv[1]) as $line){

    $matches = array();
    preg_match('/^(\w+)\s+(\d+)\s+(\d+)\s+(\-|\+)\s+"(\w+.\d)"\;/',$line,$matches);
    $matched_line = array_shift($matches); // remove the first element
    vprintf("%s\t%s\t%s\t%s\t%s\n",$matches);
}

this would output this

$ php /tmp/preg_replace.php /tmp/data
chr1    134901  139379  -   ENSG00000237683.5
chr1    860260  879955  +   ENSG00000187634.6
chr1    861264  866445  -   ENSG00000268179.1
chr1    879584  894689  -   ENSG00000188976.6
chr1    895967  901095  +   ENSG00000187961.9

Solution 5

A sed solution that makes sure we're only fiddling around with the fifth column:

sed -E 's/^(([^ ]+ +){4})"([^"]+)";$/\1\3/' infile
chr1    134901  139379  -   ENSG00000237683.5
chr1    860260  879955  +   ENSG00000187634.6
chr1    861264  866445  -   ENSG00000268179.1
chr1    879584  894689  -   ENSG00000188976.6
chr1    895967  901095  +   ENSG00000187961.9

This works also without ERE (-E, or -r for some older sed), but requires a lot more backslashes. The +-quantifier is ERE-only according to the POSIX spec1 and can be replaced by {1,} (or \{1,\} for BRE).

In case the columns aren't space-separated, the spaces can be replaced by the [:blank:] POSIX character class to also match tabs.

The regex in detail:

^               # Anchored at start of line
(               # Capture group 1 for first 4 columns
    (           # Capture group 2 for repeat count
        [^ ]+   # 1 or more non-spaces
         +      # 1 or more spaces
    ){4}        # 4 times "word plus spaces" (columns)
)               # End capture group 1
"               # Column 5 starts with double quote (not captured)
(               # Capture group 3 for column 5
    [^"]+       # One or more non-quote characters
)               # End capture group 3
";              # Quote and semicolon at end of column 5
$               # Anchored at end of line

1 GNU sed, as an extension, allows \+ to be used in BRE as well.

Share:
57

Related videos on Youtube

Damian Romard
Author by

Damian Romard

Updated on September 18, 2022

Comments

  • Damian Romard
    Damian Romard over 1 year

    I have a playlist text file. I'm trying to extract a list of the artists and their songs. There are 39 line items and they appear as:

    Rush - Red Sector A
    Blues Traveler - Hook

    This is a unicode file.

    I'm trying to use the '-' as the delimiter and split the lines there:

    x = open(u'list.txt')
    
    for line in x:
    
        line = line.strip()
    
        elements = line.split('-')
        artist = elements[0]
        song = elements[1]
    

    I get a traceback:

    Traceback (most recent call last):
      File "playlist.py", line 34, in <module>
        song = line[1]
    IndexError: list index out of range
    

    It appears the delimiter is not being recognized. If I comment out "song = elements[1]" and print artists, I get the full line of text, delimiter and all. I've seen similar questions, but I can't get enough insight from their solutions to make this work. Any help would be appreciated.

    • jonrsharpe
      jonrsharpe over 9 years
      Are you sure you have the right dash? Try to cut and paste the precise symbol from the file you're reading.
    • Damian Romard
      Damian Romard over 9 years
      I think the it's not seeing a dash, but some representation of the dash, in unicode: \xe2
    • El Bert
      El Bert over 9 years
      Using your current example it works "Rush - Red Sector A".split("-") gives me ['Rush ', ' Red Sector A'] but with the string you had before you edit your question it wasn't working "Jace Everett – Bad Things Yes – Owner Of A Lonely Heart".split("-") gives me ['Jace Everett \xe2\x80\x93 Bad Things Yes \xe2\x80\x93 Owner Of A Lonely Heart']. Follow @jonrsharpe idea of using the symbol from the file directly
    • Damian Romard
      Damian Romard over 9 years
      that's what I see too. If I copy and paste the dash for @jonrsharpe i get File "playlist.py", line 30 SyntaxError: Non-ASCII character '\xe2' in file playlist.py on line 30, but no encoding declared
    • Admin
      Admin over 8 years
      @DigitalTrauma ya, but Dani_l already gave that solution.
  • Aphid
    Aphid over 9 years
    This has also been discussed here, and is covered in PEP 0263
  • Damian Romard
    Damian Romard over 9 years
    Just a note for noobs like me, the encoding notation needs to be all the way at the top of the script. Location, location, location :)
  • Dani_l
    Dani_l over 8 years
    This would remove from all columns, not just 5th, no?
  • System
    System over 8 years
    This is what I thought initally, but after using the code it seemed to keep all columns.
  • jasonwryan
    jasonwryan over 8 years
    @Dani_l Yes, it can be refined to operate only on the fifth field, but that was not a requirement...
  • System
    System over 8 years
    Sorry I must have not made it clear, I DO want to keep all columns. This is why it is marked as the answer.
  • jasonwryan
    jasonwryan over 8 years
    @System updated to ensure it only operates on the fifth field.
  • jasonwryan
    jasonwryan over 8 years
    I'm not sure how this satisfies the "easiest tools" criteria; just the amont of typing alone...
  • jbrahy
    jbrahy over 8 years
    I prefer php to awk and sed and this is the only answer that actually does what the original post requested by removing " and ; from only the fifth column. Give me that point back.
  • jasonwryan
    jasonwryan over 8 years
    I wasn't the downvoter, and no, my edited answer also only operates on the fifth field (and has other advantages besides brevity)...
  • jbrahy
    jbrahy over 8 years
    ah, ok. I didn't see the edited version. $5 is definitely less typing. For me PHP code is easier so I provided a solution I thought would help someone.
  • jasonwryan
    jasonwryan over 8 years
    Fair enough, it is always good to see solutions using different approaches...
  • Wildcard
    Wildcard over 8 years
    Why not just use a character class? /[;"]/ is a lot more readable and simpler in my opinion than /\"|\;/.
  • jasonwryan
    jasonwryan over 8 years
    @Wildcard didn't think of it, but you are right, a bracket expression would be a better/more legible solution...