keep only certain part of a string in a certain column

6,049

Solution 1

Using sed and column:

$ sed -E 's/ intron_([^:]*):[^[:space:]]*/ \1/' file | column -t
id  target_id    length  eff_length
1   FBgn0000721  1136    243.944268
1   FBgn0000721  1122    240.237419
2   FBgn0264373  56      0

The key part of this is the substitute command:

s/ intron_([^:]*):\S*/ \1/

It looks for intron_ and saves everything after intron_ and before the first colon into the variable 1. [^[:space:]]* matches everything from that colon to the end of the field. All of that gets replaced by the text saved in variable 1.

Using awk with tab-separated output:

$ awk -v "OFS=\t" '{$2=$2;sub(/intron_/, "", $2); sub(/:.*/, "", $2); print}' file
id      target_id       length  eff_length
1       FBgn0000721     1136    243.944268
1       FBgn0000721     1122    240.237419
2       FBgn0264373     56      0

Explanation:

  • -v "OFS=\t"

    This sets the output field separator to a tab. This helps line up the columns, possibly making column unnecessary.

  • $2=$2

    When printing a line, awk won't change to our newly-specified output field separator unless we change something on the line. Assigning the second field to the second field is sufficient to assure that the output will have tabs.

  • sub(/intron_/, "", $2)

    This removes intron_ from the second field.

  • sub(/:.*/, "", $2)

    This removes everything after the first colon from the second field.

  • print

    This prints our new line.

Using awk with custom column formatting

This is like the above but uses printf so that we can custom-format column widths and alignments as desired:

$ awk  '{sub(/intron_/, "", $2); sub(/:.*/, "", $2); printf "%-3s %-12s %8s %3s\n",$1,$2,$3,$4}' file
id  target_id      length eff_length
1   FBgn0000721      1136 243.944268
1   FBgn0000721      1122 240.237419
2   FBgn0264373        56   0

Here the statement printf "%-3s %-12s %8s %3s\n",$1,$2,$3,$4 selects column widths and alignments in the usual printf style.

Using sed and converting from tab-separated to comma-separated

$ sed -E 's/ intron_([^:]*):[^[:space:]]*/ \1/; s/[[:space:]][[:space:]]*/,/g' file 
id,target_id,length,eff_length
1,FBgn0000721,1136,243.944268
1,FBgn0000721,1122,240.237419
2,FBgn0264373,56,0

Solution 2

You can use perl:

$ perl -anle '
    BEGIN {$" = "\t"}
    print "@{[@F]}" and next if $. == 1;
    $F[1] = $1 if /_([^:]*):/;
    print "@{[@F]}";
' file
id  target_id   length  eff_length
1   FBgn0000721 1136    243.944268
1   FBgn0000721 1122    240.237419
2   FBgn0264373 56      0
3   FBgn0027570 54      0

Explanation

  • -a: auto split each line into array @F.

  • BEGIN {$" = "\t"}: we set list separator to tab \t, it is used when an array or array slice is interpolated in double-quoted string.

  • print "@{[@F]}" and next if $. == 1: We print the header, process to next line.

  • $F[1] = $1 if /_([^:]*):/: we get the value between _ and first :, save it to second element in @F.

  • print "@{[@F]}": just print the desired outpur.

Share:
6,049

Related videos on Youtube

Karli
Author by

Karli

Updated on September 18, 2022

Comments

  • Karli
    Karli over 1 year

    I have a file like this:

    id  target_id                               length  eff_length
    1   intron_FBgn0000721:20_FBgn0000721:18    1136    243.944268
    1   intron_FBgn0000721:19_FBgn0000721:18    1122    240.237419
    2   intron_FBgn0264373:2_FBgn0264373:3      56      0
    3   intron_FBgn0027570:4_FBgn0027570:3      54      0
    

    For the 2nd column target_id, I want to only keep the string (not always FBgnXXXX, sometimes other names) between intron_and the first :. So the new output file will have the simpler value for column 2 but the rest of the file remains the same.

    I tried with sed command but don't know how to delete the part I don't need.

    • Karli
      Karli over 9 years
      Thanks a lot everyone! One more question will be how should I get the whole string from "intron_XXXX:XX_XXXX:XX" to replace it with something I defined? I think the command will be sed 's (some pattern matching)/(something I want to replace)/g file. I tried several way to get the whole pattern, did not work yet.
  • juanchopanza
    juanchopanza over 9 years
    You might as well remove the first version, it distracts from the other three.
  • John1024
    John1024 over 9 years
    @juanchopanza I agree: answer updated.
  • juanchopanza
    juanchopanza over 9 years
    Actually, I can't reproduce your sed output with BSD sed. Are you using gnu sed?
  • John1024
    John1024 over 9 years
    @juanchopanza Yes, I am. Sometimes, BSD sed has issues with +. So I replaced it in the code above with *. Let me know if that works better.
  • juanchopanza
    juanchopanza over 9 years
    No, it seems the group is matching FBgn000072120_FBgn0000721:18.
  • John1024
    John1024 over 9 years
    Curious, I don't see how the group could extend beyond the first colon. It also occurred to me that \S might be GNU. So, I replaced it with [^[:space:]].
  • cuonglm
    cuonglm over 9 years
    @AvinashRaj: No, it's list separator. See: perldoc.perl.org/perlvar.html#%24LIST_SEPARATOR
  • juanchopanza
    juanchopanza over 9 years
    The posix variant worked fine.
  • Karli
    Karli over 9 years
    I am also wondering how should I get the whole string from "intron_XXXX:XX_XXXX:XX" to replace it with something I defined? I think the command will be $ sed 's (some pattern matching)/(something I want to replace)/g file
  • John1024
    John1024 over 9 years
    @Karli Yes. s/intron_[^[:space:]]*/something new/
  • Karli
    Karli over 9 years
    @John1024 this worked~So this means search from intron_ and anything after it in the field? so the ^ sign should be after intron_ (I thought it needs to be in front of...). And I am not too clear about the usage of [^[:space:]], I thought it just means space...Sorry I am asking really basic questions and thanks for all the explanations!
  • John1024
    John1024 over 9 years
    @Karli "So this means search from intron_ and anything after it in the field?" Yes. [^[:space:]] means anything except white space. (In this context, the caret ^ means "not.") So, intron_[^[:space:]]* means intron_ and all any and all characters following it up to but not including the first white space.
  • Karli
    Karli over 9 years
    @John1024 this file is actually a csv file. in this case, do I need to change the "space" to "," since each column is separated by ","?
  • John1024
    John1024 over 9 years
    @Karli Your input appears to be tab-separated values (tsv). I have added to the end of the answer a case that converts to comma-separated value output (csv). Depending on what you are using this for, such conversion may be unnecessary. Many programs that accept csv also, sometimes after jiggering an option, accept tsv. Since your input is tsv, I suspect that that mab be the case for your application.