Replace substring of characters with awk and sed

10,253

Solution 1

With GNU awk, you can do

gawk -v start=5 -v end=8 '{
    mid = substr($0, start, end-start+1)
    print substr($0, 1, start-1) gensub(/./, "N", "g", mid) substr($0, end+1)
}' file

Or with perl

perl -spe 'substr($_, $start-1, $end-$start+1) =~ s/./N/g' -- -start=5 -end=8 file

With both solutions, we pass the start and end values to the program with command line options. This makes it easy to alter the values from within a shell script. If you need to make the replacement character N dynamic as well, it should be pretty obvious how.

Solution 2

If you have GNU awk (gawk) you could set FIELDWIDTHS to split the line into fields based on character positions. This is particularly convenient for your case in gawk version >= 4.2, which supports a "wildcard" trailing fieldwidth. You can then replace characters in the second field using gsub:

echo ABCDABCDABCD | ./gawk -v i=5 -v n=4 '
  BEGIN {FIELDWIDTHS = sprintf("%d %d *", i-1, n)} 
  {gsub(/./,"N",$2)} 1
' OFS=""
ABCDNNNNABCD

In older versions of gawk, you can simulate the * by choosing a suitably large maximum size for the trailing field:

echo ABCDABCDABCD | gawk -v i=5 -v n=4 '
  BEGIN {FIELDWIDTHS = sprintf("%d %d 65536", i-1, n)} 
  {gsub(/./,"N",$2)} 1
' OFS=""
ABCDNNNNABCD

See

Processing Fixed-Width Data

Capturing Optional Trailing Data

Solution 3

Using sed

To replace characters 5 through 8 with N:

$ sed -E 's/(.{4}).{4}/\1NNNN/' test
ABCDNNNNABCD

How it works:

  • (.{4}) captures the first four characters in group 1.

  • .{4} matches the next four characters.

  • \1NNNN replaces the above with group 1 and four N.

Using GNU awk

$ gawk -F "" '{for (i=5; i<=8; i++) $i="N"} 1' OFS="" test
ABCDNNNNABCD

How it works:

  • -F "" tells awk to treat each character as a separate field.

  • for (i=5; i<=8; i++) $i="N" loops over each character from 5 through 8 and changes it to N.

  • 1 tells awk to print the line.

Share:
10,253

Related videos on Youtube

Paolo Lorenzini
Author by

Paolo Lorenzini

Applying Data Science to genetics of human populations.

Updated on September 18, 2022

Comments

  • Paolo Lorenzini
    Paolo Lorenzini over 1 year

    I have a file which contains a very long string of characters and I would like to replace a substring of it with Ns. Example:

    test

    ABCDABCDABCD
    

    I would like to replace a substring of it with all letters N with awk command and sed, all the characters from index 5 to 8, so the total length of letter N is 4.

    Output

    ABCDNNNNABCD
    

    I tried something like this:

    awk '{ v=substr($0,5,4); sed -i "s/$v/N/g";print substr($0,1,4)""v""substr($0,9,12)}' test
    

    however, this command seems to give this output:

    ABCDABCDABC
    

    And no substitution was made

    I would like to have in the code the number of the index from where to start the substitution, (here, for example, is 5) and the length number of the substitution ( here 4), so I can just modify these numbers in case I want to start in another position and for a different length of substitutions because in reality, I have a string with thousands of letter and I want to replace hundreds of characters so substitution of pattern does not work in my case

    • Angel Todorov
      Angel Todorov almost 5 years
      Awk is not like shell: you can't just put a sed call in there.