Replace a string containing newline characters

12,938

Solution 1

Three different sed commands:

sed '$!N;s/"[^"]*"\n<[^>]*>/other characters /;P;D'

sed -e :n -e '$!N;s/"[^"]*"\n<[^>]*>/other characters /;tn'

sed -e :n -e '$!N;/"$/{$!bn' -e '};s/"[^"]*"\n<[^>]*>/other characters /g'

They all three build on the basic s///ubstitution command:

s/"[^"]*"\n<[^>]*>/other characters /

They also all try to take care in their handling of the last line, as seds tend to differ on their output in edge cases. This is the meaning of $! which is an address matching every line that is !not the $last.

They also all use the Next command to append the next input line to pattern space following a \newline character. Anyone who has been seding for a while will have learned to rely on the \newline character - because the only way to get one is to explicitly put it there.

All three make some attempt to read in as little input as possible before taking action - sed acts as soon as it might and needn't read in an entire input file before doing so.

Though they do all N, they all three differ in their methods of recursion.

First Command

The first command employs a very simple N;P;D loop. These three commands are built-in to any POSIX-compatible sed and they complement one another nicely.

  • N - as already mentioned, appends the Next input line to pattern-space following an inserted \newline delimiter.
  • P - like p; it Prints pattern-space - but only up-to the first occurring \newline character. And so, given the following input/command:

    • printf %s\\n one two | sed '$!N;P;d'
  • sed Prints only one. However, with...

  • D - like d; it Deletes pattern-space and begins another line-cycle. Unlike d, D deletes only up to the first occurring \newline in pattern-space. If there is more in pattern-space following \newline character, sed begins the next line cycle with what remains. If the d in the previous example were replaced with a D, for example, sed would Print both one and two.

This command recurses only for lines which do not match the s///ubstitution statement. Because the s///ubstitution removes the \newline added with N, there is never anything remaining when sed Deletes pattern-space.

Tests could be done to apply the P and/or D selectively, but there are other commands which fit better with that strategy. Because the recursion is implemented to handle consecutive lines which match only part of the replacement rule, consecutive sequences of lines matching both ends of the s///ubstitution do not work well.:

Given this input:

first "line"
<second>"line"
<second>"line"
<second>line and so on

...it prints...

first other characters "line"
<second>other characters line and so on

It does, however, handle

first "line"
second "line"
<second>line

...just fine.

Second Command

This command is very similar to the third. Both employ a :branch/test label (as is also demonstrated in Joeseph R.'s answer here) and recurse back to it given certain conditions.

  • -e :n -e - portable sed scripts will delimit a :label definition with either a \newline or a new inline -execution statement.
    • :n - defines a label named n. This can be returned to at any time with either bn or tn.
  • tn - the test command returns to a specified label (or, if none is provided, quits the script for the current line-cycle) if any s///ubstitution since either the label was defined or since it was last called tests successful.

In this command the recursion occurs for the matching lines. If sed successfully replaces the pattern with other characters, sed returns to the :n label and tries again. If a s///ubstitution is not performed sed autoprints pattern-space and begins the next line-cycle.

This tends to handle consecutive sequences better. Where the last one failed, this prints:

first other characters other characters other characters line and so on

Third Command

As mentioned, the logic here is very similar to the last, but the test is more explicit.

  • /"$/bn - this is sed's test. Because the branch command is a function of this address, sed will only branch back to :n after a \newline is appended and pattern-space still ends with a " double-quote.

There is as little done between N and b as possible - in this way sed can very quickly gather exactly as much input as necessary to ensure that the following line cannot match your rule. The s///ubstitution differs here in that it employs the global flag - and so it will do all necessary replacements at once. Given identical input this command outputs identically to the last.

Solution 2

Well, I can think of a couple of simple ways but neither involves grep (which doesn't do substitutions anyway) or sed.

  1. Perl

    To replace each occurrence of "line"\n<second> with other characters, use:

    $ perl -00pe 's/"line"\n<second>/other characters /g' file
    first other characters line and so on
    

    Or, to treat multiple, consecutive occurrences of "line"\n<second> as one, and replace all of them with a single other characters, use:

    perl -00pe 's/(?:"line"\n<second>)+/other characters /g' file
    

    Example:

    $ cat file
    first "line"
    <second>"line"
    <second>"line"
    <second>line and so on
    $ perl -00pe 's/(?:"line"\n<second>)+/other characters /g' file
    first other characters line and so on
    

    The -00 causes Perl to read the file in "paragraph mode" which means that "lines" are defined by \n\n instead of \n, essentially, each paragraph is treated as a line. The substitution therefore matches across a newline.

  2. awk

    $  awk -v RS="\n\n" -v ORS="" '{
          sub(/"line"\n<second>/,"other characters ", $0)
          print;
        }' file 
    first other characters line and so on
    

    The same basic idea, we set the record separator (RS) to \n\n to slurp the whole file, then the output record separator to nothing (otherwise an extra newline is printed) and then use the sub() function to make the replacement.

Solution 3

read the whole file and do a global replacement:

sed -n 'H; ${x; s/"line"\n<second>/other characters /g; p}' <<END
first "line"
<second> line followed by "line"
<second> and last
END
first other characters  line followed by other characters  and last

Solution 4

Here's a variant on glenn's answer that will work if you have multiple consecutive occurrences (works with GNU sed only):

sed ':x /"line"/N;s/"line"\n<second>/other characters/;/"line"/bx' your_file

The :x is just a label for branching. Basically, what this does, is that it checks the line after substitution and if it still matches "line", it branches back to the :x label (that's what bx does) and adds another line to the buffer and starts processing it.

Share:
12,938

Related videos on Youtube

BowPark
Author by

BowPark

Updated on September 18, 2022

Comments

  • BowPark
    BowPark over 1 year

    With the bash shell, in a file with rows like the following ones

    first "line"
    <second>line and so on
    

    I would like to replace one or more occurrences of "line"\n<second> with other characters and obtain each time:

    first other characters line and so on
    

    So I have to replace a string both with special characters such as " and < and with a newline character.

    After searching between the other answers, I found that sed can accept newlines in the right-hand side of the command (so, the other characters string), but not in the left.

    Is there a way (simpler than this) to obtain this result with sed or grep?

    • mikeserv
      mikeserv over 9 years
      are you working w/ a mac? the \newline statement you make is why i ask. people seldom ask if they can do s//\n/ as you can with GNU sed, though most other seds will reject that escape on the right hand side. still, the \n escape will work on the left in any POSIX sed and you can portably translate them like y/c/\n/ though it will have the same effect as s/c/\n/g and so isnt always as useful.
  • BowPark
    BowPark over 9 years
    Yes. It works, but what if I have multiple occurrences?
  • BowPark
    BowPark over 9 years
    Sorry for the trivial question, but what is the meaning of DATA and how do you receive the text input?
  • mikeserv
    mikeserv over 9 years
    @BowPark - In this example <<\DATA\ntext input\nDATA\n is baked in, but that is only text handed to sed by the shell in a here document. It would work as well like sed 'script' filename or process that writes to stdout | sed 'script'. Does that help?
  • Jeff Hewitt
    Jeff Hewitt over 9 years
    @mikeserv Please be specific about what you mean. It worked for me.
  • Jeff Hewitt
    Jeff Hewitt over 9 years
    @mikeserv I'm sorry, I really don't know what you're talking about. I copied the above code line back into my terminal and it worked correctly.
  • mikeserv
    mikeserv over 9 years
    retracted - this does apparently work in GNU sed which takes its non-POSIX label handling far enough to accept a space as a delimiter for label declaration. You should note though, that any other sed will fail there - and will fail for N. GNU sed breaks POSIX guidelines to print pattern-space before quitting on a N on the last line, but POSIX makes it clear that if an N command is read on the last line nothing should be printed.
  • mikeserv
    mikeserv over 9 years
    If you edit the post to specify GNU I will reverse my vote and delete these comments. Also, it might be worth learning about GNU's v command which breaks in every other sed but is a no-op in GNU versions 4 and greater.
  • Angel Todorov
    Angel Todorov over 9 years
    Huh, right. Fixed
  • Jeff Hewitt
    Jeff Hewitt over 9 years
    @mikeserv Thanks for the comments. I edited the post. Please don't delete your comments as they may benefit someone else.
  • mikeserv
    mikeserv over 9 years
    in that case I will offer one more - this can be done portably like: sed -e :x -e '/"line"/{$!N' -e '};s/"line"\n<second>/other characters/;/"line"/bx'.
  • mikeserv
    mikeserv over 9 years
    sorry to nitpick again, but ${cmds} is GNU-specific - most other seds will require a \newline or an -e break between p and }. You can avoid the brackets altogether - and portably - and even avoid inserting an extra \newline character on the first line like: sed 'H;1h;$!d;x;s/"line"\n<second>/other characters /g'
  • terdon
    terdon over 9 years
    @mikeserv? Which one? The second is supposed to, the OP said they want "to replace one or more occurrences of", so eating the paragraph might well be what they expect.
  • mikeserv
    mikeserv over 9 years
    very good point. I guess I focused more on and obtain each time, but I guess it is not clear if that should be one replacement per occurrence or one replacement per sequence of occurrences... @BowPark?
  • BowPark
    BowPark over 9 years
    It is needed one replacement per occurrence.
  • BowPark
    BowPark over 9 years
    Yes it does, thank you! Why without D every modified line is double? (You used it as it is necessary; maybe I don't know sed very well)
  • terdon
    terdon over 9 years
    @BowPark OK, then the first perl approach or the awk should both work. Don't they give you the desired output?
  • BowPark
    BowPark over 9 years
    I tested it and it seems not portable. It prints an extra new-line at the beginning of the output, but the result is correct on GNU.
  • BowPark
    BowPark over 9 years
    It works, thank you, but the third line with awk should be print;}' file. I need to avoid Perl and to preferably use sed, anyway you suggested good alternatives.
  • mikeserv
    mikeserv over 9 years
    @BowPark - you get doubles when omitting the D because D otherwise Deletes from output what you now see doubled. I have just made an edit - and I may expand on that as well soon.
  • mikeserv
    mikeserv over 9 years
    @BowPark - ok, I've updated it and provided options. It might be a little easier to read/understand now. I also explicitly addressed the D thing.