How to remove symbols from a column using awk
Solution 1
Using gsub
:
awk '{gsub(/\"|\;/,"")}1' file
chr1 134901 139379 - ENSG00000237683.5
chr1 860260 879955 + ENSG00000187634.6
chr1 861264 866445 - ENSG00000268179.1
chr1 879584 894689 - ENSG00000188976.6
chr1 895967 901095 + ENSG00000187961.9
If you want to operate only on the fifth field and preserve any quotes or semicolons in other fields:
awk '{gsub(/\"|\;/,"",$5)}1' file
Solution 2
If your data is formatted exactly as shown (i.e. no other "
or ;
in other columns that need to be preserved), then you can simply use tr
to remove these characters:
tr -d '";' < input.txt > output.txt
Solution 3
Using sed to remove all instances of '";':
sed -i 's/[";]//g' file
To only remove from 5th column sed is probably not the best option.
Solution 4
I know the original post asked for sed or awk but if you want to remove the " and ; from only the fifth column I'd use regex and php. There's probably a way to do this in AWK but I like to use the easiest tools.
<?php
foreach(file($argv[1]) as $line){
$matches = array();
preg_match('/^(\w+)\s+(\d+)\s+(\d+)\s+(\-|\+)\s+"(\w+.\d)"\;/',$line,$matches);
$matched_line = array_shift($matches); // remove the first element
vprintf("%s\t%s\t%s\t%s\t%s\n",$matches);
}
this would output this
$ php /tmp/preg_replace.php /tmp/data
chr1 134901 139379 - ENSG00000237683.5
chr1 860260 879955 + ENSG00000187634.6
chr1 861264 866445 - ENSG00000268179.1
chr1 879584 894689 - ENSG00000188976.6
chr1 895967 901095 + ENSG00000187961.9
Solution 5
A sed solution that makes sure we're only fiddling around with the fifth column:
sed -E 's/^(([^ ]+ +){4})"([^"]+)";$/\1\3/' infile
chr1 134901 139379 - ENSG00000237683.5
chr1 860260 879955 + ENSG00000187634.6
chr1 861264 866445 - ENSG00000268179.1
chr1 879584 894689 - ENSG00000188976.6
chr1 895967 901095 + ENSG00000187961.9
This works also without ERE (-E
, or -r
for some older sed), but requires a lot more backslashes. The +
-quantifier is ERE-only according to the POSIX spec1 and can be replaced by {1,}
(or \{1,\}
for BRE).
In case the columns aren't space-separated, the spaces can be replaced by the [:blank:]
POSIX character class to also match tabs.
The regex in detail:
^ # Anchored at start of line
( # Capture group 1 for first 4 columns
( # Capture group 2 for repeat count
[^ ]+ # 1 or more non-spaces
+ # 1 or more spaces
){4} # 4 times "word plus spaces" (columns)
) # End capture group 1
" # Column 5 starts with double quote (not captured)
( # Capture group 3 for column 5
[^"]+ # One or more non-quote characters
) # End capture group 3
"; # Quote and semicolon at end of column 5
$ # Anchored at end of line
1 GNU sed, as an extension, allows \+
to be used in BRE as well.
Related videos on Youtube
Damian Romard
Updated on September 18, 2022Comments
-
Damian Romard over 1 year
I have a playlist text file. I'm trying to extract a list of the artists and their songs. There are 39 line items and they appear as:
Rush - Red Sector A
Blues Traveler - HookThis is a unicode file.
I'm trying to use the '-' as the delimiter and split the lines there:
x = open(u'list.txt') for line in x: line = line.strip() elements = line.split('-') artist = elements[0] song = elements[1]
I get a traceback:
Traceback (most recent call last): File "playlist.py", line 34, in <module> song = line[1] IndexError: list index out of range
It appears the delimiter is not being recognized. If I comment out "song = elements[1]" and print artists, I get the full line of text, delimiter and all. I've seen similar questions, but I can't get enough insight from their solutions to make this work. Any help would be appreciated.
-
jonrsharpe over 9 yearsAre you sure you have the right dash? Try to cut and paste the precise symbol from the file you're reading.
-
Damian Romard over 9 yearsI think the it's not seeing a dash, but some representation of the dash, in unicode: \xe2
-
El Bert over 9 yearsUsing your current example it works
"Rush - Red Sector A".split("-")
gives me['Rush ', ' Red Sector A']
but with the string you had before you edit your question it wasn't working"Jace Everett – Bad Things Yes – Owner Of A Lonely Heart".split("-")
gives me['Jace Everett \xe2\x80\x93 Bad Things Yes \xe2\x80\x93 Owner Of A Lonely Heart']
. Follow @jonrsharpe idea of using the symbol from the file directly -
Damian Romard over 9 yearsthat's what I see too. If I copy and paste the dash for @jonrsharpe i get
File "playlist.py", line 30 SyntaxError: Non-ASCII character '\xe2' in file playlist.py on line 30, but no encoding declared
-
Admin over 8 years@DigitalTrauma ya, but Dani_l already gave that solution.
-
-
Aphid over 9 years
-
Damian Romard over 9 yearsJust a note for noobs like me, the encoding notation needs to be all the way at the top of the script. Location, location, location :)
-
Dani_l over 8 yearsThis would remove from all columns, not just 5th, no?
-
System over 8 yearsThis is what I thought initally, but after using the code it seemed to keep all columns.
-
jasonwryan over 8 years@Dani_l Yes, it can be refined to operate only on the fifth field, but that was not a requirement...
-
System over 8 yearsSorry I must have not made it clear, I DO want to keep all columns. This is why it is marked as the answer.
-
jasonwryan over 8 years@System updated to ensure it only operates on the fifth field.
-
jasonwryan over 8 yearsI'm not sure how this satisfies the "easiest tools" criteria; just the amont of typing alone...
-
jbrahy over 8 yearsI prefer php to awk and sed and this is the only answer that actually does what the original post requested by removing " and ; from only the fifth column. Give me that point back.
-
jasonwryan over 8 yearsI wasn't the downvoter, and no, my edited answer also only operates on the fifth field (and has other advantages besides brevity)...
-
jbrahy over 8 yearsah, ok. I didn't see the edited version. $5 is definitely less typing. For me PHP code is easier so I provided a solution I thought would help someone.
-
jasonwryan over 8 yearsFair enough, it is always good to see solutions using different approaches...
-
Wildcard over 8 yearsWhy not just use a character class?
/[;"]/
is a lot more readable and simpler in my opinion than/\"|\;/
. -
jasonwryan over 8 years@Wildcard didn't think of it, but you are right, a bracket expression would be a better/more legible solution...