how do I load 100 million rows in to memory

java sql jdbc out-of-memory

33,381

Solution 1

A hundred million records means that each record may take up at most 50 bytes in order to fit within 6 GB + some extra space for other allocations. In Java 50 bytes is nothing; a mere Object[] takes 32 bytes per element. You must find a way to immediately use the results in your while (rs.next()) loop and not retain them in full.

Solution 2

The problem is I get the java.lang.OutOfMemoryError in the s.executeQuery( line it self

You can split your query in multiple ones:

    s.executeQuery("SELECT movie_id,customer_id,rating FROM ratings LIMIT 0,300"); //shows the first 300 results
    //process this first result
    s.executeQuery("SELECT movie_id,customer_id,rating FROM ratings LIMIT 300,600");//shows 300 results starting from the 300th one
    //process this second result
    //etc

You can do a while that stops when no more results are found

Solution 3

You can call stmt.setFetchSize(50); and conn.setAutoCommitMode(false); to avoid reading the entire ResultSet into memory.

Here's what the docs says:

Getting results based on a cursor

By default the driver collects all the results for the query at once. This can be inconvenient for large data sets so the JDBC driver provides a means of basing a ResultSet on a database cursor and only fetching a small number of rows.

A small number of rows are cached on the client side of the connection and when exhausted the next block of rows is retrieved by repositioning the cursor.

Note:

Cursor based ResultSets cannot be used in all situations. There a number of restrictions which will make the driver silently fall back to fetching the whole ResultSet at once.
The connection to the server must be using the V3 protocol. This is the default for (and is only supported by) server versions 7.4 and later.-
The Connection must not be in autocommit mode. The backend closes cursors at the end of transactions, so in autocommit mode the backend will have closed the cursor before anything can be fetched from it.-
The Statement must be created with a ResultSet type of ResultSet.TYPE_FORWARD_ONLY. This is the default, so no code will need to be rewritten to take advantage of this, but it also means that you cannot scroll backwards or otherwise jump around in the ResultSet.-
The query given must be a single statement, not multiple statements strung together with semicolons.

Example : Setting fetch size to turn cursors on and off.

Changing code to cursor mode is as simple as setting the fetch size of the Statement to the appropriate size. Setting the fetch size back to 0 will cause all rows to be cached (the default behaviour).

Class.forName("com.mysql.jdbc.Driver");
Connection conn = DriverManager.getConnection("jdbc:mysql://localhost/test?useCursorFetch=true&user=root");
// make sure autocommit is off 
conn.setAutoCommit(false); 
Statement st = conn.createStatement();

// Turn use of the cursor on. 
st.setFetchSize(50);
ResultSet rs = st.executeQuery("SELECT * FROM mytable");
while (rs.next()) {
   System.out.print("a row was returned.");
} 
rs.close();

// Turn the cursor off. 
st.setFetchSize(0);
rs = st.executeQuery("SELECT * FROM mytable");
while (rs.next()) {
   System.out.print("many rows were returned.");
} 
rs.close();

// Close the statement. 
st.close();

33,381

Author by

ravindrab

"When there is no enemy within, the enemies outside cannot hurt you"

Updated on August 03, 2020

Comments

ravindrab almost 4 years

I have the need of loading 100 million+ rows from a MySQL database in to memory. My java program fails with java.lang.OutOfMemoryError: Java heap space I have 8GB RAM in my machine and I have given -Xmx6144m in my JVM options.

This is my code

public List<Record> loadTrainingDataSet() {

    ArrayList<Record> records = new ArrayList<Record>();
    try {
        Statement s = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY, java.sql.ResultSet.CONCUR_READ_ONLY);
        s.executeQuery("SELECT movie_id,customer_id,rating FROM ratings");
        ResultSet rs = s.getResultSet();
        int count = 0;
        while (rs.next()) {

Any idea how to overcome this problem?

UPDATE

I came across this post, as well as based on the comments below I updated my code. It seems I am able to load the data to memory with the same -Xmx6144m amount, but it takes a long time.

Here is my code.

...
import org.apache.mahout.math.SparseMatrix;
...

@Override
public SparseMatrix loadTrainingDataSet() {
    long t1 = System.currentTimeMillis();
    SparseMatrix ratings = new SparseMatrix(NUM_ROWS,NUM_COLS);
    int REC_START = 0;
    int REC_END = 0;

    try {
        for (int i = 1; i <= 101; i++) {
            long t11 = System.currentTimeMillis();
            REC_END = 1000000 * i;
            Statement s = conn.createStatement(java.sql.ResultSet.TYPE_FORWARD_ONLY,
                    java.sql.ResultSet.CONCUR_READ_ONLY);
            s.setFetchSize(Integer.MIN_VALUE);
            ResultSet rs = s.executeQuery("SELECT movie_id,customer_id,rating FROM ratings LIMIT " + REC_START + "," + REC_END);//100480507
            while (rs.next()) {
                int movieId = rs.getInt("movie_id");
                int customerId = rs.getInt("customer_id");
                byte rating = (byte) rs.getInt("rating");
                ratings.set(customerId,movieId,rating);
            }
            long t22 = System.currentTimeMillis();
            System.out.println("Round " + i + " completed " + (t22 - t11) / 1000 + " seconds");
            rs.close();
            s.close();
        }

    } catch (Exception e) {
        System.err.println("Cannot connect to database server " + e);
    } finally {
        if (conn != null) {
            try {
                conn.close();
                System.out.println("Database connection terminated");
            } catch (Exception e) { /* ignore close errors */ }
        }
    }
    long t2 = System.currentTimeMillis();
    System.out.println(" Took " + (t2 - t1) / 1000 + " seconds");
    return ratings;
}

To load first 100,000 rows it took 2 seconds. To load 29th 100,000 rows it took 46 seconds. I stopped the process in the middle since it was taking too much time. Are these acceptable amounts of time? Is there a way to improve the performance of this code? I am running this on 8GB RAM 64bit windows machine.