Failed to allocate memory (No MemoryError) in Ruby?

12,318

The problem is on these two lines:

source << tmp_src
source << tmp_src.gsub( /\s{2,}/, "\n" )

When you read a large file you are slowly growing a very large string in memory.

The simplest solution is not to use this temporary source string at all, but to write the results directly to the file. Just replace those two lines with this instead:

# source << tmp_src
out1.write(tmp_src) 

# source << tmp_src.gsub( /\s{2,}/, "\n" )
out1.write(tmp_src.gsub( /\s{2,}/, "\n" ))                     

This way you're not creating any big temporary strings in memory and it should work better (and faster) this way.

Share:
12,318
Sarp Kaya
Author by

Sarp Kaya

Updated on August 22, 2022

Comments

  • Sarp Kaya
    Sarp Kaya over 1 year

    I wrote a simple script that is supposed to read an entire directory and then parse the HTML data into normal script by getting rid off the HTML tags and then write it into one file.

    I have 8GB memory and also plenty of available virtual memory. When I am doing this I have more than 5GB RAM available. The largest file in the directory is 3.8 GB.

    The script is

    file_count = 1
    File.open("allscraped.txt", 'w') do |out1|
        for file_name in Dir["allParts/*.dat"] do
            puts "#{file_name}#:#{file_count}"
            file_count +=1
            File.open(file_name, "r") do |file|
                source = ""
                tmp_src = ""
                counter = 0
                file.each_line do |line|
                    scraped_content = line.gsub(/<.*?\/?>/, '')
                    tmp_src << scraped_content
                    if (counter % 10000) == 0
                        tmp_src = tmp_src.gsub( /\s{2,}/, "\n" )
                        source << tmp_src
                        tmp_src = ""
                        counter = 0
                    end
                    counter += 1
                end
                source << tmp_src.gsub( /\s{2,}/, "\n" )
                out1.write(source)
                break
            end
        end
    end
    

    The full error code is:

    realscraper.rb:33:in `block (4 levels) in <main>': failed to allocate memory (No
    MemoryError)
            from realscraper.rb:27:in `each_line'
            from realscraper.rb:27:in `block (3 levels) in <main>'
            from realscraper.rb:23:in `open'
            from realscraper.rb:23:in `block (2 levels) in <main>'
            from realscraper.rb:13:in `each'
            from realscraper.rb:13:in `block in <main>'
            from realscraper.rb:12:in `open'
            from realscraper.rb:12:in `<main>'
    

    Where line#27 is file.each_line do |line| and 33 is source << tmp_src. The failing file is the largest one (3.8 GB). What is the problem here? Why am I getting this error even though I have enough memory? Also how can I fix it?