Replace string containing newline in huge file

text-processing sed newlines

13,630

Solution 1

This really is trivial in Perl, you shouldn't hate it!

perl -i.bak -pe 's/>\n/>/' file

Explanation

-i : edit the file in place, and create a backup of the original called file.bak. If you don't want a backup, just use perl -i -pe instead.
-pe : read the input file line by line and print each line after applying the script given as -e.
s/>\n/>/ : the substitution, just like sed.

And here's an awk approach:

awk  '{if(/>$/){printf "%s",$0}else{print}}' file2

Solution 2

A perl solution:

$ perl -pe 's/(?<=>)\n//'

Explaination

s/// is used for string substitution.
(?<=>) is lookbehind pattern.
\n matches newline.

The whole pattern meanings removing all newline that have > before it.

Solution 3

How about this:

sed ':loop
  />$/ { N
    s/\n//
    b loop
  }' file

For GNU sed, you can also try adding the -u (--unbuffered) option as per the question. GNU sed is also happy with this as a simple one-liner:

sed ':loop />$/ { N; s/\n//; b loop }' file

Solution 4

You should be able to use sed with the N command, but the trick will be to delete one line from the pattern space each time that you add another (so that the pattern space always contains only 2 consecutive lines, instead of trying to read in the whole file) - try

sed ':a;$!N;s/>\n/>/;P;D;ba'

EDIT: after re-reading Peteris Krumins' Famous Sed One-Liners Explained I believe a better sed solution would be

sed -e :a -e '/>$/N; s/\n//; ta'

which only appends the following line in the case that it's already made a > match at the end, and should conditionally loop back to handle the case of consecutive matching lines (it is Krumin's 39. Append a line to the next if it ends with a backslash "\" exactly except for the substitution of > for \ as the join character, and the fact that the join character is retained in the output).

Solution 5

what about using ed?

ed -s test.txt <<< $'/fruits/s/apple/banana/g\nw'

(via http://wiki.bash-hackers.org/howto/edit-ed)

View more solutions

13,630

MattBianco

Updated on September 18, 2022

Comments

MattBianco over 1 year
Anyone know of a non-line-based tool to "binary" search/replace strings in a somewhat memory-efficient way? See this question too.

I have a +2GB text file that I would like to process similar to what this appears to do:
```
sed -e 's/>\n/>/g'
```
That means, I want to remove all newlines that occur after a >, but not anywhere else, so that rules out tr -d.

This command (that I got from the answer of a similar question) fails with couldn't re-allocate memory :
```
sed --unbuffered ':a;N;$!ba;s/>\n/>/g'
```
So, are there any other methods without resorting to C? I hate perl, but am willing to make an exception in this case :-)

I don't know for sure of any character that does not occur in the data, so temporary replacing \n with another character is something I'd like to avoid if possible.

Any good ideas, anyone?
- ctrl-alt-delor almost 10 years
  
  Have you tried option --unbuffered?
- MattBianco almost 10 years
  
  With or without --unbuffered runs out of memory
- ctrl-alt-delor almost 10 years
  
  What does $! do?
- ctrl-alt-delor almost 10 years
  
  What is wrong with the first sed command. The second seems to be reading everything into pattern space, I don't know that the $! is though. This I expect will need a LOT of memory.
- MattBianco almost 10 years
  
  The problem is that sed reads everything as lines, that's why the first command doesn't remove the newlines, since it outputs the text row-by-row again. The second command is just a workaround. I think sed is not the proper tool in this case.
- mikeserv almost 10 years
  
  sed is the perfect tool for this case - but $! loops back to branch :a until it reaches the last line. Look at steeldriver's answer - his keeps 2 lines in memory as opposed to 2gbs.
- Graeme almost 10 years
  
  @MattBianco, if you are looking for a different solution, you are better to add a separate question.
- MattBianco almost 10 years
  
  I ended up using gsar like this.
MattBianco almost 10 years

care to comment what the parts of the program does? I'm always looking to learn.
Stéphane Chazelas almost 10 years

That doesn't remove the last \n if the file ends in >\n, but that's probably preferable anyway.
Stéphane Chazelas almost 10 years

That doesn't work if 2 consecutive lines end in > (that's also GNU specific)
Graeme almost 10 years

@StéphaneChazelas, why does the closing } need to be in a separate expression? will this not work as a multiline expression?
Stéphane Chazelas almost 10 years

That will work in POSIX seds with b loop\n} or -e 'b loop' -e '}' but not as b loop;} and certainly not as b loop} because } and ; are valid in label names (though nobody in their right mind would use it. And that means GNU sed is not POSIX conformant) and the } command needs to be separated from the b command.
Angel Todorov almost 10 years

or s/>\K\n// would also work
Angel Todorov almost 10 years

+1. awk golf: awk '{ORS=/>$/?"":"\n"}1'
MattBianco almost 10 years

Why I dislike perl in general is the same reason why I chose this answer (or actually your comment to Gnouc's answer): readability. Using perl -pe with a simple "sed pattern" is way more readable than a complex sed-expression.
cuonglm almost 10 years

@terdon: Just the first thing I though, remove instead of replace
cuonglm almost 10 years

@glennjackman: good point!
Graeme almost 10 years

@StéphaneChazelas, GNU sed is happy with all of the above even with --posix! The standard also has the following for brace expressions - The list of sed functions shall be surrounded by braces and separated by <newline>s. Does this not mean that semicolons should only be used outside of braces?
terdon almost 10 years

@MattBianco fair enough but, just so you know, that has nothing to do with Perl. The lookbehind that Gnouc used is a feature of some regular expression languages (including, but not limited to, PCREs), not Perl's fault at all. Also, after featuring this sed monstrosity ':a;N;$!ba;s/>\n/>/g' in your question, you've waived your right to complain about readability! :P
terdon almost 10 years

@glennjackman nice! I was playing with the foo ? bar : baz construct but couldn't get it to work.
cuonglm almost 10 years

@MattBianco: Sorry, I have some works while writting my answer. I updated it.
cuonglm almost 10 years

@terdon: Yeap, my mistake. Delete it.
Graeme almost 10 years

@mikeserv, the loop is needed to handle consecutive lines ending in >. The original never had one, this was pointed out by Stéphane.
MattBianco almost 10 years

@terdon I never claimed to understand the sed monstrosity I put in the question. What I wanted was the first sed expression, which works fine with perl. However, even perl seems to run out of memory sometimes. Does it not work on "streams"? Very strange. I just got Out of memory! :-( This is when invoked in a pipe, without -i
terdon almost 10 years

@MattBianco huh, that is strange. The out of memory is probably due to however your system is buffering the pipe though. The perl command reads line by line so there should be no memory issues there.
Graeme almost 10 years

@mikeserv, that's not the problem. The problem is that when you do the N the next line is removed from the input, so the only way to catch consecutive lines ending in > is to apply the regex to the patter buffer again. Try it and see echo -e 'one\ntwo>\nthree>\nfour\nfive' | sed '/>$/!b;N;s/\n//'.
MattBianco almost 10 years

@terdon well.. I continued building my script, and replaced all other newlines with a string as a next step. It was when dealing with that file I ran out of memory, since the line was then very very long. Are there no simple search-and-replace string tools from the unix days that are not line-oriented?
terdon almost 10 years

@MattBianco not that I know of (but there may be regardless). However, I really don't see how this perl snippet could possibly run out of memory since it never holds more than a single line in memory. I'm guessing it's your shell that is running out because of the way the pipe is being buffered. You might want to post a question explaining your entire workflow so we can help you with your final objective rather than each small step.
mikeserv almost 10 years

I know - that's why I deleted it. It will work if you swap it around though and use hold space: sed H;s/.*//;x;/>$/{s/\n//;h;d} - but the hold space would require you to clean it. You're better off that way.
Graeme almost 10 years

@mikserv, I'm not sure how that one is working but it doesn't catch the last in the sequence of lines to be joined. Also I am getting blank lines inserted - echo -e 'one>\ntwo>\nthree\nfour\nfive>\nsix' | sed 'H;s/.*//;x;/>$/{s/\n//;h;d}'
mikeserv almost 10 years

Monstrous is right - it was 2.5gbs!
mikeserv almost 10 years

Both methods are already demonstrated here to better effect in other answers. And his approach with sed does not work without a 2.5gigabyte buffer.
Gilles 'SO- stop being evil' almost 10 years

Did anybody mention awk? Oh, I missed it, I'd only noticed perl in terdon's answer for some reason. Nobody mentioned the tr approach — mikeserv, you posted a different (valid, but less generic) approach that happens to also use tr.
MattBianco almost 10 years

I accept this answer because it was simple, readable, and did what I asked for. But I ended up using gsar which I needed for my other problem, explained in this answer.
mikeserv almost 10 years

valid, but less generic sounds to me like youve just called it a working, targeted solution. i think its hard to argue that such a thing isnt useful which is odd because it has 0 upvotes. The biggest difference i can see between my own solution and your more generic offering, is that mine specifically solves a problem, whereas yours might generally. That might make it worthwhile - and i may even reverse my vote - but theres also the pesky matter of the 7 hours between them and the recurring theme of your answers mimicking others. Can you explain this?
Scott - Слава Україні almost 10 years

I can’t get your first answer to work at all. While I admire the elegance of the second one, I believe that you need to remove the *. The way it is now, it will delete any blank lines following a line that ends with a >. … Hmm. Looking back at the question, I see that it’s a little ambiguous. The question says, “I want to remove all newlines that occur after a >, …” I interpret that to mean that >\n\n\n\n\nfoo should be changed to \n\n\n\nfoo, but I suppose foo might be the desired output.
mikeserv almost 10 years

@Scott - I tested with variations on the following: printf '>\n>\n\n>>\n>\n>>>\n>\nf\n\nff\n>\n' | tr '>\n' '\n>' | sed 's/^>*//;H;/./!d;x;y/\n>/>\n/' - that results in >>>>>>>>>>f\n\nff\n\n for me with the first answer. I am curious though what you're doing to break it though, because I'd like to fix it. As to the second point - I don't agree that it is ambiguous. The OP does not ask to remove all > preceding a \newline, but instead to remove all \newlines following a >.
Scott - Слава Україні almost 10 years

Yes, but a valid interpretation is that, in >\n\n\n\n\n, only the first newline is after a >; all the others are following other newlines. Note that the OP’s “this is what I want, if only it worked” suggestion was sed -e 's/>\n/>/g', not sed -e 's/>\n*/>/g'.
mikeserv almost 10 years

@Scott - the suggestion did not work and never could. I don't believe that the code suggestion of someone who does not fully understand the code can be considered as valid an interpreting point as the plain language that person also uses. And besides, the output - if it actually worked - of s/>\n/>/ on >\n\n\n\n\n would still be something that s/>\n/>/ would edit.
andrej over 9 years

edited, there is no dependency on website anymore