How to handle large dataset with JPA (or at least with Hibernate)?

48,572

Solution 1

suppose that we have 2 tables in DB: Worker and WorkLog with about 1000 rows in the first one and 10 000 000 rows in the second one

For high volumes like this, my recommendation would be to use The StatelessSession interface from Hibernate:

Alternatively, Hibernate provides a command-oriented API that can be used for streaming data to and from the database in the form of detached objects. A StatelessSession has no persistence context associated with it and does not provide many of the higher-level life cycle semantics. In particular, a stateless session does not implement a first-level cache nor interact with any second-level or query cache. It does not implement transactional write-behind or automatic dirty checking. Operations performed using a stateless session never cascade to associated instances. Collections are ignored by a stateless session. Operations performed via a stateless session bypass Hibernate's event model and interceptors. Due to the lack of a first-level cache, Stateless sessions are vulnerable to data aliasing effects. A stateless session is a lower-level abstraction that is much closer to the underlying JDBC.

StatelessSession session = sessionFactory.openStatelessSession();
Transaction tx = session.beginTransaction();

ScrollableResults customers = session.getNamedQuery("GetCustomers")
    .scroll(ScrollMode.FORWARD_ONLY);
while ( customers.next() ) {
    Customer customer = (Customer) customers.get(0);
    customer.updateStuff(...);
    session.update(customer);
}

tx.commit();
session.close();

In this code example, the Customer instances returned by the query are immediately detached. They are never associated with any persistence context.

The insert(), update() and delete() operations defined by the StatelessSession interface are considered to be direct database row-level operations. They result in the immediate execution of a SQL INSERT, UPDATE or DELETE respectively. They have different semantics to the save(), saveOrUpdate() and delete() operations defined by the Session interface.

Solution 2

It seems you can do this with EclipseLink too. Check this : http://wiki.eclipse.org/EclipseLink/Examples/JPA/Pagination :

Query query = em.createQuery...
query.setHint(QueryHints.CURSOR, true)
     .setHint(QueryHints.SCROLLABLE_CURSOR, true)
ScrollableCursor scrl = (ScrollableCursor)q.getSingleResult();
Object o = null;
while ((o = scrl.next()) != null) { ... }

Solution 3

There are several techniques that may need to be used in conjunction with one another to create and manipulate queries for large data-sets where memory is a limitation:

  1. Use setFetchSize(some value, maybe 100+) as the default (via JDBC) is 10. This is more about performance and is the single biggest related factor thereof. Can be done in JPA using queryHint available from provider (Hibernate, etc). There does not (for whatever reason) seem to be a JPA Query.setFetchSize(int) method.
  2. Do not try to marshall the entire result-set for 10K+ records. Several strategies apply: For GUIs, use paging or a framework that does paging. Consider Lucene or commercial searching/indexing engines (Endeca if the company has the money). For sending data somewhere, stream it and flush the buffer every N records to limit how much memory is used. The stream may be flushed to a file, network, etc. Remember that underneath, JPA uses JDBC and JDBC keeps the result-set on the Server, only fetching N-rows in a row-set group at a time. This break-down can be manipulated to facilitate flushing data in groups.
  3. Consider what the use-case is. Typically, an application is trying to answer questions. When the answer is to weed through 10K+ rows, then the design should be reviewed. Again, consider using indexing engines like Lucene, refine the queries, consider using BloomFilters as contains check caches to find needles in haystacks without going to the database, etc.

Solution 4

Raw SQL shouldn't be considered a last resort. It should still be considered an option if you want to keep things "standard" on the JPA tier, but not on the database tier. JPA also has support for native queries where it will still do the mapping to standard entities for you.

However, if you have a large result set that cannot be processed in the database, then you really should just use plain JDBC as JPA (standard) does not support streaming of large sets of data.

It will be harder to port your application across different application servers if you use JPA implementation specific constructs since the JPA engine is embedded in the application server and you may not have a control on which JPA provider is being used.

Share:
48,572
Roman
Author by

Roman

Updated on October 21, 2020

Comments

  • Roman
    Roman over 3 years

    I need to make my web-app work with really huge datasets. At the moment I get either OutOfMemoryException or output which is being generated 1-2 minutes.

    Let's put it simple and suppose that we have 2 tables in DB: Worker and WorkLog with about 1000 rows in the first one and 10 000 000 rows in the second one. Latter table has several fields including 'workerId' and 'hoursWorked' fields among others. What we need is:

    1. count total hours worked by each user;

    2. list of work periods for each user.

    The most straightforward approach (IMO) for each task in plain SQL is:

    1)

    select Worker.name, sum(hoursWorked) from Worker, WorkLog 
       where Worker.id = WorkLog.workerId 
       group by Worker.name;
    
    //results of this query should be transformed to Multimap<Worker, Long>
    

    2)

    select Worker.name, WorkLog.start, WorkLog.hoursWorked from Worker, WorkLog
       where Worker.id = WorkLog.workerId;
    
    //results of this query should be transformed to Multimap<Worker, Period>
    //if it was JDBC then it would be vitally 
    //to set resultSet.setFetchSize (someSmallNumber), ~100
    

    So, I have two questions:

    1. how to implement each of my approaches with JPA (or at least with Hibernate);
    2. how would you handle this problem (with JPA or Hibernate of course)?
  • Roman
    Roman about 14 years
    @Pascal Thivent: thanks for the answer! About volumes: I don't know the real volumes, I just specified the maximum (in my opinion which is based on some knowledge of the domain). Maybe the real volume is 10-100 times less and IMHO the solution for these volumes will be ok too.
  • nalply
    nalply over 11 years
    Optimized? Can you explain this?
  • Guido
    Guido over 11 years
    Do you know what they exactly mean by "stateless sessions are vulnerable to data aliasing effects"? Thanks.
  • Shiladittya Chakraborty
    Shiladittya Chakraborty over 8 years
    setHint method shoing undefined .
  • phil294
    phil294 about 7 years
    This is in no way faster. In fact, it is extremely slow and way less performant than the usual use of EntityManager.
  • phil294
    phil294 about 7 years
    this. I found that executing a connection query manually (session.doWork or such) is in fact the fastest you can get
  • Archimedes Trajano
    Archimedes Trajano about 7 years
    the standard EntityManager has no doWork operation.
  • phil294
    phil294 about 7 years
    yes, that's why I wrote session which you can get via entityManager.unwrap(Session.class);. Idk if that is bad programming style though. I guess one could also write a sessionFactory Bean
  • Archimedes Trajano
    Archimedes Trajano about 7 years
    That would likely be hibernate specific. There's no Session class for JPA. For the benefit of others can you update your comment to state the full package of the session class?
  • phil294
    phil294 about 7 years
    I cannot update the comment anymore, so: Idea says: org.hibernate.Session.
  • Darrell Teague
    Darrell Teague almost 3 years
    JDBC did a lot of things right but... respectfully - it is not good for large result sets. While the client-side tweaking of setFetchSize(int) is at the core of making this functional in terms of controlling client-side memory - it does not inherently solve the problem. That is, when one is looking for a (few) records in hundreds of thousands or millions - full text search engines provide superior performance and use-case support.
  • Darrell Teague
    Darrell Teague almost 3 years
    Depending on the resulting list size, this may crash and burn with a large data set. Have found that the general pattern of storing RI data in a database, then using a search engine (Lucene et al) on top of that is superior. Provides full-text search pattern support, superior performance, built-in paging without overburdening client memory requirements, etc. In short, almost never just return (some) Collection in response to (arbitrary) user queries in that it may be thousands (or millions?) of records in size.