Replace string in a huge (70GB), one line, text file

text-processing sed large-files

30,960

Solution 1

The usual text processing tools are not designed to handle lines that don't fit in RAM. They tend to work by reading one record (one line), manipulating it, and outputting the result, then proceeding to the next record (line).

If there's an ASCII character that appears frequently in the file and doesn't appear in <unk> or <raw_unk>, then you can use that as the record separator. Since most tools don't allow custom record separators, swap between that character and newlines. tr processes bytes, not lines, so it doesn't care about any record size. Supposing that ; works:

<corpus.txt tr '\n;' ';\n' |
sed 's/<unk>/<raw_unk>/g' |
tr '\n;' ';\n' >corpus.txt.new

You could also anchor on the first character of the text you're searching for, assuming that it isn't repeated in the search text and it appears frequently enough. If the file may start with unk>, change the sed command to sed '2,$ s/… to avoid a spurious match.

<corpus.txt tr '\n<' '<\n' |
sed 's/^unk>/raw_unk>/g' |
tr '\n<' '<\n' >corpus.txt.new

Alternatively, use the last character.

<corpus.txt tr '\n>' '>\n' |
sed 's/<unk$/<raw_unk/g' |
tr '\n>' '>\n' >corpus.txt.new

Note that this technique assumes that sed operates seamlessly on a file that doesn't end with a newline, i.e. that it processes the last partial line without truncating it and without appending a final newline. It works with GNU sed. If you can pick the last character of the file as the record separator, you'll avoid any portability trouble.

Solution 2

For such a big file, one possibility is Flex. Let unk.l be:

%%
\<unk\>     printf("<raw_unk>");  
%%

Then compile and execute:

$ flex -o unk.c  unk.l
$ cc -o unk -O2 unk.c -lfl
$ unk < corpus.txt > corpus.txt.new

Solution 3

So you don't have enough physical memory (RAM) to hold the whole file at once, but on a 64-bit system you have enough virtual address space to map the entire file. Virtual mappings can be useful as a simple hack in cases like this.

The necessary operations are all included in Python. There are several annoying subtleties, but it does avoid having to write C code. In particular, care is needed to avoid copying the file in memory, which would defeat the point entirely. On the plus side, you get error-reporting for free (python "exceptions") :).

#!/usr/bin/python3
# This script takes input from stdin
# (but it must be a regular file, to support mapping it),
# and writes the result to stdout.

search = b'<unk>'
replace = b'<raw_unk>'


import sys
import os
import mmap

# sys.stdout requires str, but we want to write bytes
out_bytes = sys.stdout.buffer

mem = mmap.mmap(sys.stdin.fileno(), 0, access=mmap.ACCESS_READ)
i = mem.find(search)
if i < 0:
    sys.exit("Search string not found")

# mmap object subscripts to bytes (making a copy)
# memoryview object subscripts to a memoryview object
# (it implements the buffer protocol).
view = memoryview(mem)

out_bytes.write(view[:i])
out_bytes.write(replace)
out_bytes.write(view[i+len(search):])

Solution 4

There is a replace utility in the mariadb-server/mysql-server package. It replaces simple strings (not regular expressions) and unlike grep/sed/awk replace does not care about \n and \0. Memory consumption is constant with any input file (about 400kb on my machine).

Of course you do not need to run a mysql server in order to use replace, it is only packaged that way in Fedora. Other distros/operating systems may have it packaged separately.

Solution 5

I think the C version might perform much better:

#include <stdio.h>
#include <string.h>

#define PAT_LEN 5

int main()
{
    /* note this is not a general solution. In particular the pattern
     * must not have a repeated sequence at the start, so <unk> is fine
     * but aardvark is not, because it starts with "a" repeated, and ababc
     * is not because it starts with "ab" repeated. */
    char pattern[] = "<unk>";          /* set PAT_LEN to length of this */
    char replacement[] = "<raw_unk>"; 
    int c;
    int i, j;

    for (i = 0; (c = getchar()) != EOF;) {
        if (c == pattern[i]) {
            i++;
            if (i == PAT_LEN) {
                printf("%s", replacement);
                i = 0;
            }
        } else {
            if (i > 0) {
                for (j = 0; j < i; j++) {
                    putchar(pattern[j]);
                }
                i = 0;
            }
            if (c == pattern[0]) {
                i = 1;
            } else {
                putchar(c);
            }
        }
    }
    /* TODO: fix up end of file if it ends with a part of pattern */
    return 0;
}

EDIT: Modified according to suggestions from the comments. Also fixed bug with the pattern <<unk>.

View more solutions

30,960

Author by

Christos Baziotis

Updated on September 18, 2022

Comments

Christos Baziotis almost 2 years
I have a huge (70GB), one line, text file and I want to replace a string (token) in it. I want to replace the token <unk>, with another dummy token (glove issue).

I tried sed:
```
sed 's/<unk>/<raw_unk>/g' < corpus.txt > corpus.txt.new
```
but the output file corpus.txt.new has zero-bytes!

I also tried using perl:
```
perl -pe 's/<unk>/<raw_unk>/g' < corpus.txt > corpus.txt.new
```
but I got an out of memory error.

For smaller files, both of the above commands work.

How can I replace a string is such a file? This is a related question, but none of the answers worked for me.

Edit: What about splitting the file in chunks of 10GBs (or whatever) each and applying sed on each one of them and then merging them with cat? Does that make sense? Is there a more elegant solution?
- RomanPerekhrest over 6 years
  
  as @Gilles noted, can you detect some repeated character that could serve as a custom delimiter in your single big line?
- ctrl-alt-delor over 6 years
  
  I am thinking that a tool that can only do search and replace, but not any more complex regex, would be faster. It would also not benefit from doing a line at a time, so would not choke on this file. Unfortunately I have no idea of the existence of such a tool, though it would not be hard to write. If it is a one off then substituting in newline characters as in one of the answers would probably be easiest.
- Patrick Bucher over 6 years
  
  Does your file contain anything other than ASCII? If so, all the unicode handling could be omitted and raw bytes could be processed.
- Thomas Carlisle over 6 years
  
  I agree with @PatrickButcher Look at a bigger picture. Besides the immediate need to replace this text, what else is this file supposed to be used for? If it is a log of some sort, no one is going to be able to work with it effectively. If it is a data file that some app uses, then that app should hold the responsibility for maintaining the data in that file.
- ashu over 6 years
  
  @ThomasCarlisle: GloVe is related to Word2vec.
- Martin Bonner supports Monica over 6 years
  
  Your example replacement is s/<unk>/<raw_unk>/. If you could use rnk rather than raw_unk, you could edit the file in-place with a simple C program - which would avoid having to write out 70GB (of course, you would still need to read 70GB, and if the <unk> occurs frequently, you end up rewriting most of the file anyway.
- LSpice over 6 years
  
  In case you have easy access to a Mac system, Hex Fiend is designed for editing huge files.
- Vladislavs Dovgalecs over 6 years
  
  You can use split with -b option defining chunk file sizes in bytes. Process each in turn using sed and the re-assemble. There is a risk is that <unk> can be split in two files and won't be found...
- LSpice over 6 years
  
  @VladislavsDovgalecs, good idea! For the problem you mention, maybe a sliding window will address it (for example, checking files split on 0-, 1-, 2-, 3-, and 4-aligned boundaries quintuples the work, but prevents a possible miss); or else, more economically, just giving special processing to any chunk (and its successor) that ends with any initial substring of <unk>?
- Kusalananda over 6 years
  
  Is this an XML file? In that case, there are XML parsers that can do manipulations on XML streams...
cat over 6 years

does this avoid reading the entire file into memory?
Patrick Bucher over 6 years

It reads the file character by character and never holds the entire file in the memory, just individual characters.
Patrick Bucher over 6 years

scanner.Split(bufio.ScanRunes) does the magic.
Patrick Bucher over 6 years

Also check go doc bufio.MaxScanTokenSize for the default buffer size.
Patrick Bucher over 6 years

I measured 0.3 seconds for 30 megabytes, so it could be done within 12 minutes.
MiniMax over 6 years

I also was thinking about this method, but using the while read -N 1000 chunk; (the 1000 picked as an example). The solution for the <unk>, broken between the chunks, is two passes through the file: the first with the 100MB chunks and the second with the '100MB + 5 byte' chunks. But it is not optimal solution in the case of the 70GB file.
RiaD over 6 years

you may print (pattern[j]) instead of (buf[j]) (they are equal at this point, so you don't need buffer
RiaD over 6 years

also code will not work for string "<<unk>" ideone.com/ncM2yy
jamesqf over 6 years

If you want to do it efficiently, doing binary read/writes and asynchronous I/O would probably help.
Peter Cordes over 6 years

30 MB in 0.3 seconds? That's only 90 MB / second. memcpy speed (i.e. the memory bottleneck) is something like 12GB / second on a recent x86 CPU (e.g. Skylake). Even with stdio + system call overhead, for a 30MB file hot in disk cache, I'd expect maybe 1GB / second for an efficient implementation. Did you compile with optimization disabled, or is one-char-at-a-time I/O really that slow? getchar_unlocked / putchar_unlocked might help, but definitely better to read/write in chunks of maybe 128kiB (half of L2 cache size on most x86 CPUs, so you mostly hit in L2 while looping after read)
Wildcard over 6 years

I don't have such a file to test with, but in Awk you can specify the "Record Separator" and the "Output Record Separator". So assuming you have a decent smattering of commas in your file, it's possible you could solve this with: awk -v RS=, -v ORS=, '{gsub(/<unk>/, "<raw_unk>"); print}' No?
Gilles 'SO- stop being evil' over 6 years

@Wildcard Yes, that's another solution. Awk tends to be slower than sed though, that's why I don't offer it as the preferred solution for a huge file.
jthill over 6 years

make has default rules for this, instead of the flex/cc you can add an %option main as the first line of unk.l and then just make unk. I more-or-less reflexively use %option main 8bit fast, and have export CFLAGS='-march=native -pipe -Os' in my .bashrc.
Rui F Ribeiro over 6 years

from top of my head, getchar and putchar is slow.
icarus over 6 years

The fix to the program for "<<unk>" still doesn't work if the pattern starts with a repeated sequence of characters (i.e. it wouldn't work if you were trying to replace aardvark with zebra and you had input of aaardvak, or you were trying to replace ababc and had input of abababc). In general you can not move forward by the number of characters you have read unless you know that there is no possibility of a match starting in the characters you have read.
icarus over 6 years

Like your C program, this doesn't work for replacing aardvark with zebra with an input of aaardvark.
Stéphane Chazelas over 6 years

@MiniMax, that second pass would not necessarily help as the first pass would have added 5 bytes for each occurrence of <unk>.
Stéphane Chazelas over 6 years

@roaima, yes that would be a much more involved solution. Here it's a simple approach which is only highly probable (assuming the <unk> occurrences are far appart, if not, use $/ = ">" and s/<unk>\z/<raw_unk>/g) of being correct.
jamesqf over 6 years

@undercat: If it weren't off-topic, I could show you a number of non-compiler front end applications, from solving the water-level problem to special-purpose input parsing. It's amazing what you can do with it, if you think outside the box a bit :-)
Rahul over 6 years

If My system has about 4 gb consequite memory free out of the 8 gb, does mem = mmap.mmap(sys.stdin.fileno(), 0, access=mmap.ACCESS_READ) mean that it place the data in that space? Or would it be much lower (1gb?)>
user2948306 over 6 years

@Rahul "So you don't have enough RAM, but on a 64-bit system you have enough virtual address space to map the entire file." It's paged in and out of physical ram on demand (or lack thereof). This program should work without requiring any large amount of physical RAM. 64-bit systems have much more virtual address space than the maximum physical ram. Also each running process has it's own virtual address space. This means the system as a whole running out of virtual address space isn't a thing, it's not a valid concept.
Patrick Bucher over 6 years

@icarus: Sure, the program only works for the pattern <unk>. It's not a general solution to a substitution problem, it's a solution for the very specific problem at hand.
Patrick Bucher over 6 years

@icarus: As stated in the comment to the C program, the program only deals with the one specific substitution of <unk>. Other patterns require a different logic.
Rahul over 6 years

Thank you. I was trying to understand from this example how the os will decide regarding memory allocation when we use this python method. I understand now, its just same a typical executable. The same range of memory, hence no limitation or optimizations. I am not familiar with memory mapping in python. Thank you for the exaplanation.
user2948306 over 6 years

@Rahul yep! python mmap.mmap() is a fairly thin wrapper around the C function mmap(). And mmap() is the same mechanism used to run executables, and code from shared libraries.
beasy over 6 years

You can set the record separator in Perl with command line option -0 and the octal value of a char, or inside the script it can be set with special variable $/
jamesqf over 6 years

But why would one want to avoid writing C? For this, it's no more difficult (assuming a knowedge of both languages) than Python, and perhaps a bit more compact.
Rahul over 6 years

@jamesqf I could be wrong, but I feel it is just a personal choice. Since the performance losses would be negligible (because as he said, the function actual does call the c function), the overhead wastage is very low, since no other stuff is happening in between. C would have been better, but this solution was not aiming for optimization, just to solve the bigger and difficult 70gb issue.
user2948306 over 6 years

In general, writing in python is more compact. In this case it turned out there's a couple of details in the python version, and the C version might have been nicer to write. (Though it's not so simple if search can contain a NUL character. And I notice the other C version here does not support NUL characters in replace.). You're very welcome to derive the C version for comparison purposes. However remember that my version includes basic error reporting for the operations it performs. The C version would at least be more annoying to read IMO, when error reporting is included.
jamesqf over 6 years

@Rahul: Sure, it's personal choice, but I don't think "to avoid writing in C" is a good basis for that choice. Especially if one happens to be fluent in C. Indeed, I'd probably use C to avoid writing in Python :-)
Rahul over 6 years

@jamesqf I respect your opinion, and also that of op. When I answered your comment, the context was my comment - regarding memory space and how a c function was wrapped. Hence my response. When I read this response of yours, I did read the answer again, and understood the context you quote. I had not responded to that. So, I withdraw. I think the conversation was between you two, and I meddled, sorry for that to both of you. And also, I dont know deep c or python, I am studying :)
user285259 over 6 years

@Gilles : But using awk avoid passing the stream twice to tr. So would it be still slower ?
Gilles 'SO- stop being evil' over 6 years

@user285259 Typically not. tr is very fast and the pipe can even be parallelized.
rackandboneman over 6 years

How do you make sure you are not breaking the pattern in edge cases where there isn't enough whitespace available?
alfreema over 6 years

As stated, for this to be robust there's a requirement that there is at least one space every X characters. You can do that analysis easy enough, with any blocksize you choose: fold -w X mailtest.txt | grep -v " " | wc -l The number it returns is the number of folded lines with potential edge cases. If it's zero, the solution is guaranteed to work.
yieldsfalsehood over 6 years

What if <unk> falls on a boundary between chunks?
JJoao over 6 years

@jthill, thank you: %option main + make + optionally CFLAGS are a very nice trick!! Is -march=native the default behaviour?
JJoao over 6 years

@jamesqf, I feel the same and I am curious about your point of view and message: why not writing a new question to specifically raise that topic?
jamesqf over 6 years

@JJoao: I'm not sure how such a question could be worded. "What interesting non compiler frontend things can you do with flex?" perhaps.
user1730706 over 6 years

@jamesqf as you said - will be hard to make that an on topic question - but I would like to see it also
user2948306 over 6 years

yes, yes it is. "Don't use Hadoop - your data isn't that big". This is a very simple streaming IO problem.
lord AJ over 6 years

@jamesqf A prof of mine at uni used flex to build a tool that recognised fabric types for a factory! How about asking something like: "flex seems like a very powerful tool but I'm unlikely to be writing any compilers/parsers - are there any other use cases for flex?"
MoonCheese62 over 6 years

@jamesqf: I'm unaware of a standard libc function that searches arbitrary byte sequences in raw memory, i. e. disregarding any NUL termination characters. Of course we can roll our own, but that's going to be at least one of slow (with a naive algorithm) or error prone (for an efficient algorithm). Python’s find function on buffer-like objects is supposedly well tested and implements a sensible search algorithm like Boyer-Moore.
Trevor Boyd Smith over 6 years

i've never used tr before and only a little sed. if the gawk solution is a little more easy to read/understand, maybe post the gawk solution in the comments.
jazzdelightsme over 6 years

All these bugs is why writing it in python was the better choice. Premature optimization is the root of all evil and the hallmark of an inexperienced engineer!
Kusalananda over 6 years

No. sed reads a line at a time into memory regardless. It will not be able to fit this line.
Kusalananda over 6 years

I can find no documentation that says anything other than that GNU sed will not do input/output buffering when using this flag. I can't see that it will read partial lines.
botkop over 3 years

this awk, not sed -> upvoting
jena over 3 years

Sorry, but why not just using tr directly to solve the problem? I mean what's wrong with tr "<unk>" "<raw_unk>"?
Gilles 'SO- stop being evil' over 3 years

@jena tr changes single characters. This command changes all u to r, all n to a, all k to w and all > to _. It's impossible to replace a specific multi-character string with tr.
jena over 3 years

Oh I see, I forgot about that, thanks.