Replace string in a huge (70GB), one line, text file

30,960

Solution 1

The usual text processing tools are not designed to handle lines that don't fit in RAM. They tend to work by reading one record (one line), manipulating it, and outputting the result, then proceeding to the next record (line).

If there's an ASCII character that appears frequently in the file and doesn't appear in <unk> or <raw_unk>, then you can use that as the record separator. Since most tools don't allow custom record separators, swap between that character and newlines. tr processes bytes, not lines, so it doesn't care about any record size. Supposing that ; works:

<corpus.txt tr '\n;' ';\n' |
sed 's/<unk>/<raw_unk>/g' |
tr '\n;' ';\n' >corpus.txt.new

You could also anchor on the first character of the text you're searching for, assuming that it isn't repeated in the search text and it appears frequently enough. If the file may start with unk>, change the sed command to sed '2,$ s/… to avoid a spurious match.

<corpus.txt tr '\n<' '<\n' |
sed 's/^unk>/raw_unk>/g' |
tr '\n<' '<\n' >corpus.txt.new

Alternatively, use the last character.

<corpus.txt tr '\n>' '>\n' |
sed 's/<unk$/<raw_unk/g' |
tr '\n>' '>\n' >corpus.txt.new

Note that this technique assumes that sed operates seamlessly on a file that doesn't end with a newline, i.e. that it processes the last partial line without truncating it and without appending a final newline. It works with GNU sed. If you can pick the last character of the file as the record separator, you'll avoid any portability trouble.

Solution 2

For such a big file, one possibility is Flex. Let unk.l be:

%%
\<unk\>     printf("<raw_unk>");  
%%

Then compile and execute:

$ flex -o unk.c  unk.l
$ cc -o unk -O2 unk.c -lfl
$ unk < corpus.txt > corpus.txt.new

Solution 3

So you don't have enough physical memory (RAM) to hold the whole file at once, but on a 64-bit system you have enough virtual address space to map the entire file. Virtual mappings can be useful as a simple hack in cases like this.

The necessary operations are all included in Python. There are several annoying subtleties, but it does avoid having to write C code. In particular, care is needed to avoid copying the file in memory, which would defeat the point entirely. On the plus side, you get error-reporting for free (python "exceptions") :).

#!/usr/bin/python3
# This script takes input from stdin
# (but it must be a regular file, to support mapping it),
# and writes the result to stdout.

search = b'<unk>'
replace = b'<raw_unk>'


import sys
import os
import mmap

# sys.stdout requires str, but we want to write bytes
out_bytes = sys.stdout.buffer

mem = mmap.mmap(sys.stdin.fileno(), 0, access=mmap.ACCESS_READ)
i = mem.find(search)
if i < 0:
    sys.exit("Search string not found")

# mmap object subscripts to bytes (making a copy)
# memoryview object subscripts to a memoryview object
# (it implements the buffer protocol).
view = memoryview(mem)

out_bytes.write(view[:i])
out_bytes.write(replace)
out_bytes.write(view[i+len(search):])

Solution 4

There is a replace utility in the mariadb-server/mysql-server package. It replaces simple strings (not regular expressions) and unlike grep/sed/awk replace does not care about \n and \0. Memory consumption is constant with any input file (about 400kb on my machine).

Of course you do not need to run a mysql server in order to use replace, it is only packaged that way in Fedora. Other distros/operating systems may have it packaged separately.

Solution 5

I think the C version might perform much better:

#include <stdio.h>
#include <string.h>

#define PAT_LEN 5

int main()
{
    /* note this is not a general solution. In particular the pattern
     * must not have a repeated sequence at the start, so <unk> is fine
     * but aardvark is not, because it starts with "a" repeated, and ababc
     * is not because it starts with "ab" repeated. */
    char pattern[] = "<unk>";          /* set PAT_LEN to length of this */
    char replacement[] = "<raw_unk>"; 
    int c;
    int i, j;

    for (i = 0; (c = getchar()) != EOF;) {
        if (c == pattern[i]) {
            i++;
            if (i == PAT_LEN) {
                printf("%s", replacement);
                i = 0;
            }
        } else {
            if (i > 0) {
                for (j = 0; j < i; j++) {
                    putchar(pattern[j]);
                }
                i = 0;
            }
            if (c == pattern[0]) {
                i = 1;
            } else {
                putchar(c);
            }
        }
    }
    /* TODO: fix up end of file if it ends with a part of pattern */
    return 0;
}

EDIT: Modified according to suggestions from the comments. Also fixed bug with the pattern <<unk>.

Share:
30,960
Christos Baziotis
Author by

Christos Baziotis

Updated on September 18, 2022

Comments

  • Christos Baziotis
    Christos Baziotis almost 2 years

    I have a huge (70GB), one line, text file and I want to replace a string (token) in it. I want to replace the token <unk>, with another dummy token (glove issue).

    I tried sed:

    sed 's/<unk>/<raw_unk>/g' < corpus.txt > corpus.txt.new
    

    but the output file corpus.txt.new has zero-bytes!

    I also tried using perl:

    perl -pe 's/<unk>/<raw_unk>/g' < corpus.txt > corpus.txt.new
    

    but I got an out of memory error.

    For smaller files, both of the above commands work.

    How can I replace a string is such a file? This is a related question, but none of the answers worked for me.

    Edit: What about splitting the file in chunks of 10GBs (or whatever) each and applying sed on each one of them and then merging them with cat? Does that make sense? Is there a more elegant solution?

    • RomanPerekhrest
      RomanPerekhrest over 6 years
      as @Gilles noted, can you detect some repeated character that could serve as a custom delimiter in your single big line?
    • ctrl-alt-delor
      ctrl-alt-delor over 6 years
      I am thinking that a tool that can only do search and replace, but not any more complex regex, would be faster. It would also not benefit from doing a line at a time, so would not choke on this file. Unfortunately I have no idea of the existence of such a tool, though it would not be hard to write. If it is a one off then substituting in newline characters as in one of the answers would probably be easiest.
    • Patrick Bucher
      Patrick Bucher over 6 years
      Does your file contain anything other than ASCII? If so, all the unicode handling could be omitted and raw bytes could be processed.
    • Thomas Carlisle
      Thomas Carlisle over 6 years
      I agree with @PatrickButcher Look at a bigger picture. Besides the immediate need to replace this text, what else is this file supposed to be used for? If it is a log of some sort, no one is going to be able to work with it effectively. If it is a data file that some app uses, then that app should hold the responsibility for maintaining the data in that file.
    • ashu
      ashu over 6 years
      @ThomasCarlisle: GloVe is related to Word2vec.
    • Martin Bonner supports Monica
      Martin Bonner supports Monica over 6 years
      Your example replacement is s/<unk>/<raw_unk>/. If you could use rnk rather than raw_unk, you could edit the file in-place with a simple C program - which would avoid having to write out 70GB (of course, you would still need to read 70GB, and if the <unk> occurs frequently, you end up rewriting most of the file anyway.
    • LSpice
      LSpice over 6 years
      In case you have easy access to a Mac system, Hex Fiend is designed for editing huge files.
    • Vladislavs Dovgalecs
      Vladislavs Dovgalecs over 6 years
      You can use split with -b option defining chunk file sizes in bytes. Process each in turn using sed and the re-assemble. There is a risk is that <unk> can be split in two files and won't be found...
    • LSpice
      LSpice over 6 years
      @VladislavsDovgalecs, good idea! For the problem you mention, maybe a sliding window will address it (for example, checking files split on 0-, 1-, 2-, 3-, and 4-aligned boundaries quintuples the work, but prevents a possible miss); or else, more economically, just giving special processing to any chunk (and its successor) that ends with any initial substring of <unk>?
    • Kusalananda
      Kusalananda over 6 years
      Is this an XML file? In that case, there are XML parsers that can do manipulations on XML streams...
  • cat
    cat over 6 years
    does this avoid reading the entire file into memory?
  • Patrick Bucher
    Patrick Bucher over 6 years
    It reads the file character by character and never holds the entire file in the memory, just individual characters.
  • Patrick Bucher
    Patrick Bucher over 6 years
    scanner.Split(bufio.ScanRunes) does the magic.
  • Patrick Bucher
    Patrick Bucher over 6 years
    Also check go doc bufio.MaxScanTokenSize for the default buffer size.
  • Patrick Bucher
    Patrick Bucher over 6 years
    I measured 0.3 seconds for 30 megabytes, so it could be done within 12 minutes.
  • MiniMax
    MiniMax over 6 years
    I also was thinking about this method, but using the while read -N 1000 chunk; (the 1000 picked as an example). The solution for the <unk>, broken between the chunks, is two passes through the file: the first with the 100MB chunks and the second with the '100MB + 5 byte' chunks. But it is not optimal solution in the case of the 70GB file.
  • RiaD
    RiaD over 6 years
    you may print (pattern[j]) instead of (buf[j]) (they are equal at this point, so you don't need buffer
  • RiaD
    RiaD over 6 years
    also code will not work for string "<<unk>" ideone.com/ncM2yy
  • jamesqf
    jamesqf over 6 years
    If you want to do it efficiently, doing binary read/writes and asynchronous I/O would probably help.
  • Peter Cordes
    Peter Cordes over 6 years
    30 MB in 0.3 seconds? That's only 90 MB / second. memcpy speed (i.e. the memory bottleneck) is something like 12GB / second on a recent x86 CPU (e.g. Skylake). Even with stdio + system call overhead, for a 30MB file hot in disk cache, I'd expect maybe 1GB / second for an efficient implementation. Did you compile with optimization disabled, or is one-char-at-a-time I/O really that slow? getchar_unlocked / putchar_unlocked might help, but definitely better to read/write in chunks of maybe 128kiB (half of L2 cache size on most x86 CPUs, so you mostly hit in L2 while looping after read)
  • Wildcard
    Wildcard over 6 years
    I don't have such a file to test with, but in Awk you can specify the "Record Separator" and the "Output Record Separator". So assuming you have a decent smattering of commas in your file, it's possible you could solve this with: awk -v RS=, -v ORS=, '{gsub(/<unk>/, "<raw_unk>"); print}' No?
  • Gilles 'SO- stop being evil'
    Gilles 'SO- stop being evil' over 6 years
    @Wildcard Yes, that's another solution. Awk tends to be slower than sed though, that's why I don't offer it as the preferred solution for a huge file.
  • jthill
    jthill over 6 years
    make has default rules for this, instead of the flex/cc you can add an %option main as the first line of unk.l and then just make unk. I more-or-less reflexively use %option main 8bit fast, and have export CFLAGS='-march=native -pipe -Os' in my .bashrc.
  • Rui F Ribeiro
    Rui F Ribeiro over 6 years
    from top of my head, getchar and putchar is slow.
  • icarus
    icarus over 6 years
    The fix to the program for "<<unk>" still doesn't work if the pattern starts with a repeated sequence of characters (i.e. it wouldn't work if you were trying to replace aardvark with zebra and you had input of aaardvak, or you were trying to replace ababc and had input of abababc). In general you can not move forward by the number of characters you have read unless you know that there is no possibility of a match starting in the characters you have read.
  • icarus
    icarus over 6 years
    Like your C program, this doesn't work for replacing aardvark with zebra with an input of aaardvark.
  • Stéphane Chazelas
    Stéphane Chazelas over 6 years
    @MiniMax, that second pass would not necessarily help as the first pass would have added 5 bytes for each occurrence of <unk>.
  • Stéphane Chazelas
    Stéphane Chazelas over 6 years
    @roaima, yes that would be a much more involved solution. Here it's a simple approach which is only highly probable (assuming the <unk> occurrences are far appart, if not, use $/ = ">" and s/<unk>\z/<raw_unk>/g) of being correct.
  • jamesqf
    jamesqf over 6 years
    @undercat: If it weren't off-topic, I could show you a number of non-compiler front end applications, from solving the water-level problem to special-purpose input parsing. It's amazing what you can do with it, if you think outside the box a bit :-)
  • Rahul
    Rahul over 6 years
    If My system has about 4 gb consequite memory free out of the 8 gb, does mem = mmap.mmap(sys.stdin.fileno(), 0, access=mmap.ACCESS_READ) mean that it place the data in that space? Or would it be much lower (1gb?)>
  • user2948306
    user2948306 over 6 years
    @Rahul "So you don't have enough RAM, but on a 64-bit system you have enough virtual address space to map the entire file." It's paged in and out of physical ram on demand (or lack thereof). This program should work without requiring any large amount of physical RAM. 64-bit systems have much more virtual address space than the maximum physical ram. Also each running process has it's own virtual address space. This means the system as a whole running out of virtual address space isn't a thing, it's not a valid concept.
  • Patrick Bucher
    Patrick Bucher over 6 years
    @icarus: Sure, the program only works for the pattern <unk>. It's not a general solution to a substitution problem, it's a solution for the very specific problem at hand.
  • Patrick Bucher
    Patrick Bucher over 6 years
    @icarus: As stated in the comment to the C program, the program only deals with the one specific substitution of <unk>. Other patterns require a different logic.
  • Rahul
    Rahul over 6 years
    Thank you. I was trying to understand from this example how the os will decide regarding memory allocation when we use this python method. I understand now, its just same a typical executable. The same range of memory, hence no limitation or optimizations. I am not familiar with memory mapping in python. Thank you for the exaplanation.
  • user2948306
    user2948306 over 6 years
    @Rahul yep! python mmap.mmap() is a fairly thin wrapper around the C function mmap(). And mmap() is the same mechanism used to run executables, and code from shared libraries.
  • beasy
    beasy over 6 years
    You can set the record separator in Perl with command line option -0 and the octal value of a char, or inside the script it can be set with special variable $/
  • jamesqf
    jamesqf over 6 years
    But why would one want to avoid writing C? For this, it's no more difficult (assuming a knowedge of both languages) than Python, and perhaps a bit more compact.
  • Rahul
    Rahul over 6 years
    @jamesqf I could be wrong, but I feel it is just a personal choice. Since the performance losses would be negligible (because as he said, the function actual does call the c function), the overhead wastage is very low, since no other stuff is happening in between. C would have been better, but this solution was not aiming for optimization, just to solve the bigger and difficult 70gb issue.
  • user2948306
    user2948306 over 6 years
    In general, writing in python is more compact. In this case it turned out there's a couple of details in the python version, and the C version might have been nicer to write. (Though it's not so simple if search can contain a NUL character. And I notice the other C version here does not support NUL characters in replace.). You're very welcome to derive the C version for comparison purposes. However remember that my version includes basic error reporting for the operations it performs. The C version would at least be more annoying to read IMO, when error reporting is included.
  • jamesqf
    jamesqf over 6 years
    @Rahul: Sure, it's personal choice, but I don't think "to avoid writing in C" is a good basis for that choice. Especially if one happens to be fluent in C. Indeed, I'd probably use C to avoid writing in Python :-)
  • Rahul
    Rahul over 6 years
    @jamesqf I respect your opinion, and also that of op. When I answered your comment, the context was my comment - regarding memory space and how a c function was wrapped. Hence my response. When I read this response of yours, I did read the answer again, and understood the context you quote. I had not responded to that. So, I withdraw. I think the conversation was between you two, and I meddled, sorry for that to both of you. And also, I dont know deep c or python, I am studying :)
  • user285259
    user285259 over 6 years
    @Gilles : But using awk avoid passing the stream twice to tr. So would it be still slower ?
  • Gilles 'SO- stop being evil'
    Gilles 'SO- stop being evil' over 6 years
    @user285259 Typically not. tr is very fast and the pipe can even be parallelized.
  • rackandboneman
    rackandboneman over 6 years
    How do you make sure you are not breaking the pattern in edge cases where there isn't enough whitespace available?
  • alfreema
    alfreema over 6 years
    As stated, for this to be robust there's a requirement that there is at least one space every X characters. You can do that analysis easy enough, with any blocksize you choose: fold -w X mailtest.txt | grep -v " " | wc -l The number it returns is the number of folded lines with potential edge cases. If it's zero, the solution is guaranteed to work.
  • yieldsfalsehood
    yieldsfalsehood over 6 years
    What if <unk> falls on a boundary between chunks?
  • JJoao
    JJoao over 6 years
    @jthill, thank you: %option main + make + optionally CFLAGS are a very nice trick!! Is -march=native the default behaviour?
  • JJoao
    JJoao over 6 years
    @jamesqf, I feel the same and I am curious about your point of view and message: why not writing a new question to specifically raise that topic?
  • jamesqf
    jamesqf over 6 years
    @JJoao: I'm not sure how such a question could be worded. "What interesting non compiler frontend things can you do with flex?" perhaps.
  • user1730706
    user1730706 over 6 years
    @jamesqf as you said - will be hard to make that an on topic question - but I would like to see it also
  • user2948306
    user2948306 over 6 years
    yes, yes it is. "Don't use Hadoop - your data isn't that big". This is a very simple streaming IO problem.
  • lord AJ
    lord AJ over 6 years
    @jamesqf A prof of mine at uni used flex to build a tool that recognised fabric types for a factory! How about asking something like: "flex seems like a very powerful tool but I'm unlikely to be writing any compilers/parsers - are there any other use cases for flex?"
  • MoonCheese62
    MoonCheese62 over 6 years
    @jamesqf: I'm unaware of a standard libc function that searches arbitrary byte sequences in raw memory, i. e. disregarding any NUL termination characters. Of course we can roll our own, but that's going to be at least one of slow (with a naive algorithm) or error prone (for an efficient algorithm). Python’s find function on buffer-like objects is supposedly well tested and implements a sensible search algorithm like Boyer-Moore.
  • Trevor Boyd Smith
    Trevor Boyd Smith over 6 years
    i've never used tr before and only a little sed. if the gawk solution is a little more easy to read/understand, maybe post the gawk solution in the comments.
  • jazzdelightsme
    jazzdelightsme over 6 years
    All these bugs is why writing it in python was the better choice. Premature optimization is the root of all evil and the hallmark of an inexperienced engineer!
  • Kusalananda
    Kusalananda over 6 years
    No. sed reads a line at a time into memory regardless. It will not be able to fit this line.
  • Kusalananda
    Kusalananda over 6 years
    I can find no documentation that says anything other than that GNU sed will not do input/output buffering when using this flag. I can't see that it will read partial lines.
  • botkop
    botkop over 3 years
    this awk, not sed -> upvoting
  • jena
    jena over 3 years
    Sorry, but why not just using tr directly to solve the problem? I mean what's wrong with tr "<unk>" "<raw_unk>"?
  • Gilles 'SO- stop being evil'
    Gilles 'SO- stop being evil' over 3 years
    @jena tr changes single characters. This command changes all u to r, all n to a, all k to w and all > to _. It's impossible to replace a specific multi-character string with tr.
  • jena
    jena over 3 years
    Oh I see, I forgot about that, thanks.