sed: delete text between a string until first occurrence of another string

regex sed

10,731

Solution 1

POSIX regular expressions used by sed (both the "basic" and "extended" versions) do not support non-greedy matches. (Although there are some workarounds, such as using [^0-9]* in place of .*, they become unreliable if the inputs vary a lot.)

What you need can be achieved in Perl by using the ? non-greedy quantifier:

echo "The quick brown fox jumps in 2012 and 2013" \
   | perl -pe 's/fox.*?([0-9]{4})//g'

You might wish to remove an extra space as well.

Solution 2

You didn't specify exactly what your requirements are. You may want a multi-step process. Pick a string that you know will not occur in your input (e.g., ####):

echo "The quick brown fox jumps over 42 lazy dogs in 2012 and 2013." \
  | sed \
        -e "s/[0-9]\{4\}/&####/" \
        -e "s/fox.*####//" \
        -e "s/####//"

(Command excessively folded for readability.) The -e "s/[0-9]\{4\}/&####/" injects #### after the first four-digit number. (Warning: this will change 65536 to 6553####6.)
-e "s/fox.*####//" affects lines that contain fox and #### -- i.e., lines which contain at least one four-digit number -- and then deletes from fox through the first four-digit number.
-e "s/####//", of course, cleans out any #### strings that are left over from lines that contain a four-digit number but not fox.

To also remove one space after the number if there is one,

echo "The quick brown fox jumps over 42 lazy dogs in 2012 and 2013." \
  | sed \
        -e "s/[0-9]\{4\}/&####/" \
        -e "s/fox.*#### //" \
        -e "s/fox.*####//" \
        -e "s/####//"

Warning: You can add g to all the s commands, but, since this still uses .*, which is the root of your problem, it will still not handle

One fox jumps in 2012 and 2013, another fox will jump in 2014 and 2015.

the way you probably want. And, of course, you don't want to add g to "s/[0-9]\{4\}/&####/" because then it will inject #### after every four-digit number, defeating the whole point. Then the "s/fox.*####//" will end up acting just like "s/fox.*[0-9]\{4\}//" (your original command with the non-contributing characters removed); i.e., it will change

The quick brown fox jumps in 2012 and 2013.

The quick brown fox jumps in 2012#### and 2013####.

and then to

The quick brown .

Solution 3

Assuming you want to use only sed and you want the end of the match to be the first group of digits, without caring what the word is after the digits, this works:

echo "The quick brown fox jumps in 2012 and 2013" \
   | sed "s/fox[^0-9][^0-9]*[0-9][0-9]* //"

The pattern works by matching fox, followed by one or more non-digits [^0-9][^0-9]*, followed by 1 or more digits [0-9][0-9]*. This pattern will work with an arbitrary number of digits, not just 4. If you want to match exactly 4 digits, change it to:

echo "The quick brown fox jumps in 2012 and 2013" \
   | sed "s/fox[^0-9]*\([0-9]\{4\}\) //"

10,731

Marit Hoen

Updated on November 22, 2022

Comments

Marit Hoen over 1 year
Imagine I have something like the following text:

The quick brown fox jumps in 2012 and 2013

And I would wish to delete the part from "fox" including the four numbers but only in the first occurrence so I end up with:

The quick brown and 2013

Something likes this...:
```
echo "The quick brown fox jumps in 2012 and 2013" \
   | sed  "s/fox.*\([0-9]\{4\}\)//g"
```
...brings me:
```
The quick brown
```
So it removed everything including the last occurrence of the four numbers.

Any ideas?
- kinokijuf over 11 years
  
  The standart quantifiers in regular expressions are greedy, meaning they match as much as they can.
user1686 over 11 years

Damn it, outninja'd. What is that about the extra space, though? (+1)
choroba over 11 years

@grawity: Try adding a space after the right parenthesis.
Scott - Слава Україні over 11 years

Are the parentheses useful?
choroba over 11 years

@Scott: Not really in this case :-)