Replace string containing newline in huge file
Solution 1
This really is trivial in Perl, you shouldn't hate it!
perl -i.bak -pe 's/>\n/>/' file
Explanation
-i
: edit the file in place, and create a backup of the original calledfile.bak
. If you don't want a backup, just useperl -i -pe
instead.-pe
: read the input file line by line and print each line after applying the script given as-e
.s/>\n/>/
: the substitution, just likesed
.
And here's an awk
approach:
awk '{if(/>$/){printf "%s",$0}else{print}}' file2
Solution 2
A perl
solution:
$ perl -pe 's/(?<=>)\n//'
Explaination
s///
is used for string substitution.(?<=>)
is lookbehind pattern.\n
matches newline.
The whole pattern meanings removing all newline that have >
before it.
Solution 3
How about this:
sed ':loop
/>$/ { N
s/\n//
b loop
}' file
For GNU sed, you can also try adding the -u
(--unbuffered
) option as per the question. GNU sed is also happy with this as a simple one-liner:
sed ':loop />$/ { N; s/\n//; b loop }' file
Solution 4
You should be able to use sed
with the N
command, but the trick will be to delete one line from the pattern space each time that you add another (so that the pattern space always contains only 2 consecutive lines, instead of trying to read in the whole file) - try
sed ':a;$!N;s/>\n/>/;P;D;ba'
EDIT: after re-reading Peteris Krumins' Famous Sed One-Liners Explained I believe a better sed
solution would be
sed -e :a -e '/>$/N; s/\n//; ta'
which only appends the following line in the case that it's already made a >
match at the end, and should conditionally loop back to handle the case of consecutive matching lines (it is Krumin's 39. Append a line to the next if it ends with a backslash "\" exactly except for the substitution of >
for \
as the join character, and the fact that the join character is retained in the output).
Solution 5
what about using ed?
ed -s test.txt <<< $'/fruits/s/apple/banana/g\nw'
Related videos on Youtube
MattBianco
Updated on September 18, 2022Comments
-
MattBianco over 1 year
Anyone know of a non-line-based tool to "binary" search/replace strings in a somewhat memory-efficient way? See this question too.
I have a +2GB text file that I would like to process similar to what this appears to do:
sed -e 's/>\n/>/g'
That means, I want to remove all newlines that occur after a
>
, but not anywhere else, so that rules outtr -d
.This command (that I got from the answer of a similar question) fails with
couldn't re-allocate memory
:sed --unbuffered ':a;N;$!ba;s/>\n/>/g'
So, are there any other methods without resorting to C? I hate perl, but am willing to make an exception in this case :-)
I don't know for sure of any character that does not occur in the data, so temporary replacing
\n
with another character is something I'd like to avoid if possible.Any good ideas, anyone?
-
ctrl-alt-delor almost 10 yearsHave you tried option
--unbuffered
? -
MattBianco almost 10 yearsWith or without
--unbuffered
runs out of memory -
ctrl-alt-delor almost 10 yearsWhat does
$!
do? -
ctrl-alt-delor almost 10 yearsWhat is wrong with the first sed command. The second seems to be reading everything into pattern space, I don't know that the
$!
is though. This I expect will need a LOT of memory. -
MattBianco almost 10 yearsThe problem is that sed reads everything as lines, that's why the first command doesn't remove the newlines, since it outputs the text row-by-row again. The second command is just a workaround. I think
sed
is not the proper tool in this case. -
mikeserv almost 10 years
sed
is the perfect tool for this case - but$!
loops back tob
ranch:a
until it reaches the last line. Look at steeldriver's answer - his keeps 2 lines in memory as opposed to 2gbs. -
Graeme almost 10 years@MattBianco, if you are looking for a different solution, you are better to add a separate question.
-
MattBianco almost 10 yearsI ended up using
gsar
like this.
-
-
MattBianco almost 10 yearscare to comment what the parts of the program does? I'm always looking to learn.
-
Stéphane Chazelas almost 10 yearsThat doesn't remove the last
\n
if the file ends in>\n
, but that's probably preferable anyway. -
Stéphane Chazelas almost 10 yearsThat doesn't work if 2 consecutive lines end in
>
(that's also GNU specific) -
Graeme almost 10 years@StéphaneChazelas, why does the closing
}
need to be in a separate expression? will this not work as a multiline expression? -
Stéphane Chazelas almost 10 yearsThat will work in POSIX seds with
b loop\n}
or-e 'b loop' -e '}'
but not asb loop;}
and certainly not asb loop}
because}
and;
are valid in label names (though nobody in their right mind would use it. And that means GNU sed is not POSIX conformant) and the}
command needs to be separated from theb
command. -
Angel Todorov almost 10 yearsor
s/>\K\n//
would also work -
Angel Todorov almost 10 years+1. awk golf:
awk '{ORS=/>$/?"":"\n"}1'
-
MattBianco almost 10 yearsWhy I dislike perl in general is the same reason why I chose this answer (or actually your comment to Gnouc's answer): readability. Using perl -pe with a simple "sed pattern" is way more readable than a complex sed-expression.
-
cuonglm almost 10 years@terdon: Just the first thing I though, remove instead of replace
-
cuonglm almost 10 years@glennjackman: good point!
-
Graeme almost 10 years@StéphaneChazelas, GNU
sed
is happy with all of the above even with--posix
! The standard also has the following for brace expressions -The list of sed functions shall be surrounded by braces and separated by <newline>s
. Does this not mean that semicolons should only be used outside of braces? -
terdon almost 10 years@MattBianco fair enough but, just so you know, that has nothing to do with Perl. The lookbehind that Gnouc used is a feature of some regular expression languages (including, but not limited to, PCREs), not Perl's fault at all. Also, after featuring this sed monstrosity
':a;N;$!ba;s/>\n/>/g'
in your question, you've waived your right to complain about readability! :P -
terdon almost 10 years@glennjackman nice! I was playing with the
foo ? bar : baz
construct but couldn't get it to work. -
cuonglm almost 10 years@MattBianco: Sorry, I have some works while writting my answer. I updated it.
-
cuonglm almost 10 years@terdon: Yeap, my mistake. Delete it.
-
Graeme almost 10 years@mikeserv, the loop is needed to handle consecutive lines ending in
>
. The original never had one, this was pointed out by Stéphane. -
MattBianco almost 10 years@terdon I never claimed to understand the sed monstrosity I put in the question. What I wanted was the first sed expression, which works fine with perl. However, even perl seems to run out of memory sometimes. Does it not work on "streams"? Very strange. I just got
Out of memory!
:-( This is when invoked in a pipe, without-i
-
terdon almost 10 years@MattBianco huh, that is strange. The out of memory is probably due to however your system is buffering the pipe though. The perl command reads line by line so there should be no memory issues there.
-
Graeme almost 10 years@mikeserv, that's not the problem. The problem is that when you do the
N
the next line is removed from the input, so the only way to catch consecutive lines ending in>
is to apply the regex to the patter buffer again. Try it and seeecho -e 'one\ntwo>\nthree>\nfour\nfive' | sed '/>$/!b;N;s/\n//'
. -
MattBianco almost 10 years@terdon well.. I continued building my script, and replaced all other newlines with a string as a next step. It was when dealing with that file I ran out of memory, since the line was then very very long. Are there no simple search-and-replace string tools from the unix days that are not line-oriented?
-
terdon almost 10 years@MattBianco not that I know of (but there may be regardless). However, I really don't see how this perl snippet could possibly run out of memory since it never holds more than a single line in memory. I'm guessing it's your shell that is running out because of the way the pipe is being buffered. You might want to post a question explaining your entire workflow so we can help you with your final objective rather than each small step.
-
mikeserv almost 10 yearsI know - that's why I deleted it. It will work if you swap it around though and use hold space: sed
H;s/.*//;x;/>$/{s/\n//;h;d}
- but the hold space would require you to clean it. You're better off that way. -
Graeme almost 10 years@mikserv, I'm not sure how that one is working but it doesn't catch the last in the sequence of lines to be joined. Also I am getting blank lines inserted -
echo -e 'one>\ntwo>\nthree\nfour\nfive>\nsix' | sed 'H;s/.*//;x;/>$/{s/\n//;h;d}'
-
mikeserv almost 10 yearsMonstrous is right - it was 2.5gbs!
-
mikeserv almost 10 yearsBoth methods are already demonstrated here to better effect in other answers. And his approach with
sed
does not work without a 2.5gigabyte buffer. -
Gilles 'SO- stop being evil' almost 10 yearsDid anybody mention awk? Oh, I missed it, I'd only noticed perl in terdon's answer for some reason. Nobody mentioned the
tr
approach — mikeserv, you posted a different (valid, but less generic) approach that happens to also usetr
. -
MattBianco almost 10 yearsI accept this answer because it was simple, readable, and did what I asked for. But I ended up using
gsar
which I needed for my other problem, explained in this answer. -
mikeserv almost 10 yearsvalid, but less generic sounds to me like youve just called it a working, targeted solution. i think its hard to argue that such a thing isnt useful which is odd because it has 0 upvotes. The biggest difference i can see between my own solution and your more generic offering, is that mine specifically solves a problem, whereas yours might generally. That might make it worthwhile - and i may even reverse my vote - but theres also the pesky matter of the 7 hours between them and the recurring theme of your answers mimicking others. Can you explain this?
-
Scott - Слава Україні almost 10 yearsI can’t get your first answer to work at all. While I admire the elegance of the second one, I believe that you need to remove the
*
. The way it is now, it will delete any blank lines following a line that ends with a>
. … Hmm. Looking back at the question, I see that it’s a little ambiguous. The question says, “I want to remove all newlines that occur after a>
, …” I interpret that to mean that>\n\n\n\n\nfoo
should be changed to\n\n\n\nfoo
, but I supposefoo
might be the desired output. -
mikeserv almost 10 years@Scott - I tested with variations on the following:
printf '>\n>\n\n>>\n>\n>>>\n>\nf\n\nff\n>\n' | tr '>\n' '\n>' | sed 's/^>*//;H;/./!d;x;y/\n>/>\n/'
- that results in>>>>>>>>>>f\n\nff\n\n
for me with the first answer. I am curious though what you're doing to break it though, because I'd like to fix it. As to the second point - I don't agree that it is ambiguous. The OP does not ask to remove all>
preceding a\n
ewline, but instead to remove all\n
ewlines following a>
. -
Scott - Слава Україні almost 10 yearsYes, but a valid interpretation is that, in
>\n\n\n\n\n
, only the first newline is after a>
; all the others are following other newlines. Note that the OP’s “this is what I want, if only it worked” suggestion wassed -e 's/>\n/>/g'
, notsed -e 's/>\n*/>/g'
. -
mikeserv almost 10 years@Scott - the suggestion did not work and never could. I don't believe that the code suggestion of someone who does not fully understand the code can be considered as valid an interpreting point as the plain language that person also uses. And besides, the output - if it actually worked - of
s/>\n/>/
on>\n\n\n\n\n
would still be something thats/>\n/>/
would edit. -
andrej over 9 yearsedited, there is no dependency on website anymore