Replace string in a huge (70GB), one line, text file
Solution 1
The usual text processing tools are not designed to handle lines that don't fit in RAM. They tend to work by reading one record (one line), manipulating it, and outputting the result, then proceeding to the next record (line).
If there's an ASCII character that appears frequently in the file and doesn't appear in <unk>
or <raw_unk>
, then you can use that as the record separator. Since most tools don't allow custom record separators, swap between that character and newlines. tr
processes bytes, not lines, so it doesn't care about any record size. Supposing that ;
works:
<corpus.txt tr '\n;' ';\n' |
sed 's/<unk>/<raw_unk>/g' |
tr '\n;' ';\n' >corpus.txt.new
You could also anchor on the first character of the text you're searching for, assuming that it isn't repeated in the search text and it appears frequently enough. If the file may start with unk>
, change the sed command to sed '2,$ s/…
to avoid a spurious match.
<corpus.txt tr '\n<' '<\n' |
sed 's/^unk>/raw_unk>/g' |
tr '\n<' '<\n' >corpus.txt.new
Alternatively, use the last character.
<corpus.txt tr '\n>' '>\n' |
sed 's/<unk$/<raw_unk/g' |
tr '\n>' '>\n' >corpus.txt.new
Note that this technique assumes that sed operates seamlessly on a file that doesn't end with a newline, i.e. that it processes the last partial line without truncating it and without appending a final newline. It works with GNU sed. If you can pick the last character of the file as the record separator, you'll avoid any portability trouble.
Solution 2
For such a big file, one possibility is Flex. Let unk.l
be:
%%
\<unk\> printf("<raw_unk>");
%%
Then compile and execute:
$ flex -o unk.c unk.l
$ cc -o unk -O2 unk.c -lfl
$ unk < corpus.txt > corpus.txt.new
Solution 3
So you don't have enough physical memory (RAM) to hold the whole file at once, but on a 64-bit system you have enough virtual address space to map the entire file. Virtual mappings can be useful as a simple hack in cases like this.
The necessary operations are all included in Python. There are several annoying subtleties, but it does avoid having to write C code. In particular, care is needed to avoid copying the file in memory, which would defeat the point entirely. On the plus side, you get error-reporting for free (python "exceptions") :).
#!/usr/bin/python3
# This script takes input from stdin
# (but it must be a regular file, to support mapping it),
# and writes the result to stdout.
search = b'<unk>'
replace = b'<raw_unk>'
import sys
import os
import mmap
# sys.stdout requires str, but we want to write bytes
out_bytes = sys.stdout.buffer
mem = mmap.mmap(sys.stdin.fileno(), 0, access=mmap.ACCESS_READ)
i = mem.find(search)
if i < 0:
sys.exit("Search string not found")
# mmap object subscripts to bytes (making a copy)
# memoryview object subscripts to a memoryview object
# (it implements the buffer protocol).
view = memoryview(mem)
out_bytes.write(view[:i])
out_bytes.write(replace)
out_bytes.write(view[i+len(search):])
Solution 4
There is a replace
utility in the mariadb-server/mysql-server package. It replaces simple strings (not regular expressions) and unlike grep/sed/awk replace
does not care about \n
and \0
. Memory consumption is constant with any input file (about 400kb on my machine).
Of course you do not need to run a mysql server in order to use replace
, it is only packaged that way in Fedora. Other distros/operating systems may have it packaged separately.
Solution 5
I think the C version might perform much better:
#include <stdio.h>
#include <string.h>
#define PAT_LEN 5
int main()
{
/* note this is not a general solution. In particular the pattern
* must not have a repeated sequence at the start, so <unk> is fine
* but aardvark is not, because it starts with "a" repeated, and ababc
* is not because it starts with "ab" repeated. */
char pattern[] = "<unk>"; /* set PAT_LEN to length of this */
char replacement[] = "<raw_unk>";
int c;
int i, j;
for (i = 0; (c = getchar()) != EOF;) {
if (c == pattern[i]) {
i++;
if (i == PAT_LEN) {
printf("%s", replacement);
i = 0;
}
} else {
if (i > 0) {
for (j = 0; j < i; j++) {
putchar(pattern[j]);
}
i = 0;
}
if (c == pattern[0]) {
i = 1;
} else {
putchar(c);
}
}
}
/* TODO: fix up end of file if it ends with a part of pattern */
return 0;
}
EDIT: Modified according to suggestions from the comments. Also fixed bug with the pattern <<unk>
.
![Christos Baziotis](https://i.stack.imgur.com/U7UZy.jpg?s=256&g=1)
Christos Baziotis
Updated on September 18, 2022Comments
-
Christos Baziotis almost 2 years
I have a huge (70GB), one line, text file and I want to replace a string (token) in it. I want to replace the token
<unk>
, with another dummy token (glove issue).I tried
sed
:sed 's/<unk>/<raw_unk>/g' < corpus.txt > corpus.txt.new
but the output file
corpus.txt.new
has zero-bytes!I also tried using perl:
perl -pe 's/<unk>/<raw_unk>/g' < corpus.txt > corpus.txt.new
but I got an out of memory error.
For smaller files, both of the above commands work.
How can I replace a string is such a file? This is a related question, but none of the answers worked for me.
Edit: What about splitting the file in chunks of 10GBs (or whatever) each and applying
sed
on each one of them and then merging them withcat
? Does that make sense? Is there a more elegant solution?-
RomanPerekhrest over 6 yearsas @Gilles noted, can you detect some repeated character that could serve as a custom delimiter in your single big line?
-
ctrl-alt-delor over 6 yearsI am thinking that a tool that can only do search and replace, but not any more complex regex, would be faster. It would also not benefit from doing a line at a time, so would not choke on this file. Unfortunately I have no idea of the existence of such a tool, though it would not be hard to write. If it is a one off then substituting in newline characters as in one of the answers would probably be easiest.
-
Patrick Bucher over 6 yearsDoes your file contain anything other than ASCII? If so, all the unicode handling could be omitted and raw bytes could be processed.
-
Thomas Carlisle over 6 yearsI agree with @PatrickButcher Look at a bigger picture. Besides the immediate need to replace this text, what else is this file supposed to be used for? If it is a log of some sort, no one is going to be able to work with it effectively. If it is a data file that some app uses, then that app should hold the responsibility for maintaining the data in that file.
-
ashu over 6 years@ThomasCarlisle: GloVe is related to Word2vec.
-
Martin Bonner supports Monica over 6 yearsYour example replacement is
s/<unk>/<raw_unk>/
. If you could usernk
rather thanraw_unk
, you could edit the file in-place with a simple C program - which would avoid having to write out 70GB (of course, you would still need to read 70GB, and if the<unk>
occurs frequently, you end up rewriting most of the file anyway. -
LSpice over 6 yearsIn case you have easy access to a Mac system, Hex Fiend is designed for editing huge files.
-
Vladislavs Dovgalecs over 6 yearsYou can use
split
with-b
option defining chunk file sizes in bytes. Process each in turn usingsed
and the re-assemble. There is a risk is that<unk>
can be split in two files and won't be found... -
LSpice over 6 years@VladislavsDovgalecs, good idea! For the problem you mention, maybe a sliding window will address it (for example, checking files split on 0-, 1-, 2-, 3-, and 4-aligned boundaries quintuples the work, but prevents a possible miss); or else, more economically, just giving special processing to any chunk (and its successor) that ends with any initial substring of
<unk>
? -
Kusalananda over 6 yearsIs this an XML file? In that case, there are XML parsers that can do manipulations on XML streams...
-
-
cat over 6 yearsdoes this avoid reading the entire file into memory?
-
Patrick Bucher over 6 yearsIt reads the file character by character and never holds the entire file in the memory, just individual characters.
-
Patrick Bucher over 6 years
scanner.Split(bufio.ScanRunes)
does the magic. -
Patrick Bucher over 6 yearsAlso check
go doc bufio.MaxScanTokenSize
for the default buffer size. -
Patrick Bucher over 6 yearsI measured 0.3 seconds for 30 megabytes, so it could be done within 12 minutes.
-
MiniMax over 6 yearsI also was thinking about this method, but using the
while read -N 1000 chunk;
(the1000
picked as an example). The solution for the<unk>
, broken between the chunks, is two passes through the file: the first with the 100MB chunks and the second with the '100MB + 5 byte' chunks. But it is not optimal solution in the case of the 70GB file. -
RiaD over 6 yearsyou may print (pattern[j]) instead of (buf[j]) (they are equal at this point, so you don't need buffer
-
RiaD over 6 yearsalso code will not work for string "<<unk>" ideone.com/ncM2yy
-
jamesqf over 6 yearsIf you want to do it efficiently, doing binary read/writes and asynchronous I/O would probably help.
-
Peter Cordes over 6 years30 MB in 0.3 seconds? That's only 90 MB / second.
memcpy
speed (i.e. the memory bottleneck) is something like 12GB / second on a recent x86 CPU (e.g. Skylake). Even with stdio + system call overhead, for a 30MB file hot in disk cache, I'd expect maybe 1GB / second for an efficient implementation. Did you compile with optimization disabled, or is one-char-at-a-time I/O really that slow?getchar_unlocked
/putchar_unlocked
might help, but definitely better to read/write in chunks of maybe 128kiB (half of L2 cache size on most x86 CPUs, so you mostly hit in L2 while looping after read) -
Wildcard over 6 yearsI don't have such a file to test with, but in Awk you can specify the "Record Separator" and the "Output Record Separator". So assuming you have a decent smattering of commas in your file, it's possible you could solve this with:
awk -v RS=, -v ORS=, '{gsub(/<unk>/, "<raw_unk>"); print}'
No? -
Gilles 'SO- stop being evil' over 6 years@Wildcard Yes, that's another solution. Awk tends to be slower than sed though, that's why I don't offer it as the preferred solution for a huge file.
-
jthill over 6 years
make
has default rules for this, instead of the flex/cc you can add an%option main
as the first line of unk.l and then justmake unk
. I more-or-less reflexively use%option main 8bit fast
, and haveexport CFLAGS='-march=native -pipe -Os'
in my.bashrc
. -
Rui F Ribeiro over 6 yearsfrom top of my head, getchar and putchar is slow.
-
icarus over 6 yearsThe
fix
to the program for"<<unk>"
still doesn't work if thepattern
starts with a repeated sequence of characters (i.e. it wouldn't work if you were trying to replace aardvark with zebra and you had input of aaardvak, or you were trying to replace ababc and had input of abababc). In general you can not move forward by the number of characters you have read unless you know that there is no possibility of a match starting in the characters you have read. -
icarus over 6 yearsLike your
C
program, this doesn't work for replacing aardvark with zebra with an input of aaardvark. -
Stéphane Chazelas over 6 years@MiniMax, that second pass would not necessarily help as the first pass would have added 5 bytes for each occurrence of
<unk>
. -
Stéphane Chazelas over 6 years@roaima, yes that would be a much more involved solution. Here it's a simple approach which is only highly probable (assuming the
<unk>
occurrences are far appart, if not, use$/ = ">"
ands/<unk>\z/<raw_unk>/g
) of being correct. -
jamesqf over 6 years@undercat: If it weren't off-topic, I could show you a number of non-compiler front end applications, from solving the water-level problem to special-purpose input parsing. It's amazing what you can do with it, if you think outside the box a bit :-)
-
Rahul over 6 yearsIf My system has about 4 gb consequite memory free out of the 8 gb, does mem = mmap.mmap(sys.stdin.fileno(), 0, access=mmap.ACCESS_READ) mean that it place the data in that space? Or would it be much lower (1gb?)>
-
user2948306 over 6 years@Rahul "So you don't have enough RAM, but on a 64-bit system you have enough virtual address space to map the entire file." It's paged in and out of physical ram on demand (or lack thereof). This program should work without requiring any large amount of physical RAM. 64-bit systems have much more virtual address space than the maximum physical ram. Also each running process has it's own virtual address space. This means the system as a whole running out of virtual address space isn't a thing, it's not a valid concept.
-
Patrick Bucher over 6 years@icarus: Sure, the program only works for the pattern
<unk>
. It's not a general solution to a substitution problem, it's a solution for the very specific problem at hand. -
Patrick Bucher over 6 years@icarus: As stated in the comment to the C program, the program only deals with the one specific substitution of
<unk>
. Other patterns require a different logic. -
Rahul over 6 yearsThank you. I was trying to understand from this example how the os will decide regarding memory allocation when we use this python method. I understand now, its just same a typical executable. The same range of memory, hence no limitation or optimizations. I am not familiar with memory mapping in python. Thank you for the exaplanation.
-
user2948306 over 6 years@Rahul yep! python mmap.mmap() is a fairly thin wrapper around the C function mmap(). And mmap() is the same mechanism used to run executables, and code from shared libraries.
-
beasy over 6 yearsYou can set the record separator in Perl with command line option
-0
and the octal value of a char, or inside the script it can be set with special variable$/
-
jamesqf over 6 yearsBut why would one want to avoid writing C? For this, it's no more difficult (assuming a knowedge of both languages) than Python, and perhaps a bit more compact.
-
Rahul over 6 years@jamesqf I could be wrong, but I feel it is just a personal choice. Since the performance losses would be negligible (because as he said, the function actual does call the c function), the overhead wastage is very low, since no other stuff is happening in between. C would have been better, but this solution was not aiming for optimization, just to solve the bigger and difficult 70gb issue.
-
user2948306 over 6 yearsIn general, writing in python is more compact. In this case it turned out there's a couple of details in the python version, and the C version might have been nicer to write. (Though it's not so simple if
search
can contain a NUL character. And I notice the other C version here does not support NUL characters inreplace
.). You're very welcome to derive the C version for comparison purposes. However remember that my version includes basic error reporting for the operations it performs. The C version would at least be more annoying to read IMO, when error reporting is included. -
jamesqf over 6 years@Rahul: Sure, it's personal choice, but I don't think "to avoid writing in C" is a good basis for that choice. Especially if one happens to be fluent in C. Indeed, I'd probably use C to avoid writing in Python :-)
-
Rahul over 6 years@jamesqf I respect your opinion, and also that of op. When I answered your comment, the context was my comment - regarding memory space and how a c function was wrapped. Hence my response. When I read this response of yours, I did read the answer again, and understood the context you quote. I had not responded to that. So, I withdraw. I think the conversation was between you two, and I meddled, sorry for that to both of you. And also, I dont know deep c or python, I am studying :)
-
user285259 over 6 years@Gilles : But using
awk
avoid passing the stream twice totr
. So would it be still slower ? -
Gilles 'SO- stop being evil' over 6 years@user285259 Typically not.
tr
is very fast and the pipe can even be parallelized. -
rackandboneman over 6 yearsHow do you make sure you are not breaking the pattern in edge cases where there isn't enough whitespace available?
-
alfreema over 6 yearsAs stated, for this to be robust there's a requirement that there is at least one space every X characters. You can do that analysis easy enough, with any blocksize you choose: fold -w X mailtest.txt | grep -v " " | wc -l The number it returns is the number of folded lines with potential edge cases. If it's zero, the solution is guaranteed to work.
-
yieldsfalsehood over 6 yearsWhat if
<unk>
falls on a boundary between chunks? -
JJoao over 6 years@jthill, thank you:
%option main
+make
+ optionallyCFLAGS
are a very nice trick!! Is-march=native
the default behaviour? -
JJoao over 6 years@jamesqf, I feel the same and I am curious about your point of view and message: why not writing a new question to specifically raise that topic?
-
jamesqf over 6 years@JJoao: I'm not sure how such a question could be worded. "What interesting non compiler frontend things can you do with flex?" perhaps.
-
user1730706 over 6 years@jamesqf as you said - will be hard to make that an on topic question - but I would like to see it also
-
user2948306 over 6 yearsyes, yes it is. "Don't use Hadoop - your data isn't that big". This is a very simple streaming IO problem.
-
lord AJ over 6 years@jamesqf A prof of mine at uni used flex to build a tool that recognised fabric types for a factory! How about asking something like: "flex seems like a very powerful tool but I'm unlikely to be writing any compilers/parsers - are there any other use cases for flex?"
-
MoonCheese62 over 6 years@jamesqf: I'm unaware of a standard libc function that searches arbitrary byte sequences in raw memory, i. e. disregarding any NUL termination characters. Of course we can roll our own, but that's going to be at least one of slow (with a naive algorithm) or error prone (for an efficient algorithm). Python’s
find
function on buffer-like objects is supposedly well tested and implements a sensible search algorithm like Boyer-Moore. -
Trevor Boyd Smith over 6 yearsi've never used
tr
before and only a littlesed
. if thegawk
solution is a little more easy to read/understand, maybe post thegawk
solution in the comments. -
jazzdelightsme over 6 yearsAll these bugs is why writing it in python was the better choice. Premature optimization is the root of all evil and the hallmark of an inexperienced engineer!
-
Kusalananda over 6 yearsNo.
sed
reads a line at a time into memory regardless. It will not be able to fit this line. -
Kusalananda over 6 yearsI can find no documentation that says anything other than that GNU
sed
will not do input/output buffering when using this flag. I can't see that it will read partial lines. -
botkop over 3 yearsthis awk, not sed -> upvoting
-
jena over 3 yearsSorry, but why not just using
tr
directly to solve the problem? I mean what's wrong withtr "<unk>" "<raw_unk>"
? -
Gilles 'SO- stop being evil' over 3 years@jena
tr
changes single characters. This command changes allu
tor
, alln
toa
, allk
tow
and all>
to_
. It's impossible to replace a specific multi-character string withtr
. -
jena over 3 yearsOh I see, I forgot about that, thanks.