keep only certain part of a string in a certain column
Solution 1
Using sed
and column
:
$ sed -E 's/ intron_([^:]*):[^[:space:]]*/ \1/' file | column -t
id target_id length eff_length
1 FBgn0000721 1136 243.944268
1 FBgn0000721 1122 240.237419
2 FBgn0264373 56 0
The key part of this is the substitute command:
s/ intron_([^:]*):\S*/ \1/
It looks for intron_
and saves everything after intron_
and before the first colon into the variable 1
. [^[:space:]]*
matches everything from that colon to the end of the field. All of that gets replaced by the text saved in variable 1
.
Using awk
with tab-separated output:
$ awk -v "OFS=\t" '{$2=$2;sub(/intron_/, "", $2); sub(/:.*/, "", $2); print}' file
id target_id length eff_length
1 FBgn0000721 1136 243.944268
1 FBgn0000721 1122 240.237419
2 FBgn0264373 56 0
Explanation:
-v "OFS=\t"
This sets the output field separator to a tab. This helps line up the columns, possibly making
column
unnecessary.$2=$2
When printing a line,
awk
won't change to our newly-specified output field separator unless we change something on the line. Assigning the second field to the second field is sufficient to assure that the output will have tabs.sub(/intron_/, "", $2)
This removes
intron_
from the second field.sub(/:.*/, "", $2)
This removes everything after the first colon from the second field.
print
This prints our new line.
Using awk
with custom column formatting
This is like the above but uses printf
so that we can custom-format column widths and alignments as desired:
$ awk '{sub(/intron_/, "", $2); sub(/:.*/, "", $2); printf "%-3s %-12s %8s %3s\n",$1,$2,$3,$4}' file
id target_id length eff_length
1 FBgn0000721 1136 243.944268
1 FBgn0000721 1122 240.237419
2 FBgn0264373 56 0
Here the statement printf "%-3s %-12s %8s %3s\n",$1,$2,$3,$4
selects column widths and alignments in the usual printf
style.
Using sed
and converting from tab-separated to comma-separated
$ sed -E 's/ intron_([^:]*):[^[:space:]]*/ \1/; s/[[:space:]][[:space:]]*/,/g' file
id,target_id,length,eff_length
1,FBgn0000721,1136,243.944268
1,FBgn0000721,1122,240.237419
2,FBgn0264373,56,0
Solution 2
You can use perl
:
$ perl -anle '
BEGIN {$" = "\t"}
print "@{[@F]}" and next if $. == 1;
$F[1] = $1 if /_([^:]*):/;
print "@{[@F]}";
' file
id target_id length eff_length
1 FBgn0000721 1136 243.944268
1 FBgn0000721 1122 240.237419
2 FBgn0264373 56 0
3 FBgn0027570 54 0
Explanation
-a
: auto split each line into array@F
.BEGIN {$" = "\t"}
: we set list separator to tab\t
, it is used when an array or array slice is interpolated in double-quoted string.print "@{[@F]}" and next if $. == 1
: We print the header, process to next line.$F[1] = $1 if /_([^:]*):/
: we get the value between_
and first:
, save it to second element in@F
.print "@{[@F]}"
: just print the desired outpur.
Related videos on Youtube
Karli
Updated on September 18, 2022Comments
-
Karli over 1 year
I have a file like this:
id target_id length eff_length 1 intron_FBgn0000721:20_FBgn0000721:18 1136 243.944268 1 intron_FBgn0000721:19_FBgn0000721:18 1122 240.237419 2 intron_FBgn0264373:2_FBgn0264373:3 56 0 3 intron_FBgn0027570:4_FBgn0027570:3 54 0
For the 2nd column
target_id
, I want to only keep the string (not alwaysFBgnXXXX
, sometimes other names) betweenintron_
and the first:
. So the new output file will have the simpler value for column 2 but the rest of the file remains the same.I tried with sed command but don't know how to delete the part I don't need.
-
Karli over 9 yearsThanks a lot everyone! One more question will be how should I get the whole string from "intron_XXXX:XX_XXXX:XX" to replace it with something I defined? I think the command will be sed 's (some pattern matching)/(something I want to replace)/g file. I tried several way to get the whole pattern, did not work yet.
-
-
juanchopanza over 9 yearsYou might as well remove the first version, it distracts from the other three.
-
John1024 over 9 years@juanchopanza I agree: answer updated.
-
juanchopanza over 9 yearsActually, I can't reproduce your sed output with BSD sed. Are you using gnu sed?
-
John1024 over 9 years@juanchopanza Yes, I am. Sometimes, BSD
sed
has issues with+
. So I replaced it in the code above with*
. Let me know if that works better. -
juanchopanza over 9 yearsNo, it seems the group is matching
FBgn000072120_FBgn0000721:18
. -
John1024 over 9 yearsCurious, I don't see how the group could extend beyond the first colon. It also occurred to me that
\S
might be GNU. So, I replaced it with[^[:space:]]
. -
cuonglm over 9 years@AvinashRaj: No, it's list separator. See: perldoc.perl.org/perlvar.html#%24LIST_SEPARATOR
-
juanchopanza over 9 yearsThe posix variant worked fine.
-
Karli over 9 yearsI am also wondering how should I get the whole string from "intron_XXXX:XX_XXXX:XX" to replace it with something I defined? I think the command will be $ sed 's (some pattern matching)/(something I want to replace)/g file
-
John1024 over 9 years@Karli Yes.
s/intron_[^[:space:]]*/something new/
-
Karli over 9 years@John1024 this worked~So this means search from intron_ and anything after it in the field? so the ^ sign should be after intron_ (I thought it needs to be in front of...). And I am not too clear about the usage of [^[:space:]], I thought it just means space...Sorry I am asking really basic questions and thanks for all the explanations!
-
John1024 over 9 years@Karli "So this means search from intron_ and anything after it in the field?" Yes.
[^[:space:]]
means anything except white space. (In this context, the caret^
means "not.") So,intron_[^[:space:]]*
meansintron_
and all any and all characters following it up to but not including the first white space. -
Karli over 9 years@John1024 this file is actually a csv file. in this case, do I need to change the "space" to "," since each column is separated by ","?
-
John1024 over 9 years@Karli Your input appears to be tab-separated values (tsv). I have added to the end of the answer a case that converts to comma-separated value output (csv). Depending on what you are using this for, such conversion may be unnecessary. Many programs that accept csv also, sometimes after jiggering an option, accept tsv. Since your input is tsv, I suspect that that mab be the case for your application.