How to handle large dataset with JPA (or at least with Hibernate)?

java performance hibernate jpa jakarta-ee

48,572

Solution 1

suppose that we have 2 tables in DB: Worker and WorkLog with about 1000 rows in the first one and 10 000 000 rows in the second one

For high volumes like this, my recommendation would be to use The StatelessSession interface from Hibernate:

Alternatively, Hibernate provides a command-oriented API that can be used for streaming data to and from the database in the form of detached objects. A StatelessSession has no persistence context associated with it and does not provide many of the higher-level life cycle semantics. In particular, a stateless session does not implement a first-level cache nor interact with any second-level or query cache. It does not implement transactional write-behind or automatic dirty checking. Operations performed using a stateless session never cascade to associated instances. Collections are ignored by a stateless session. Operations performed via a stateless session bypass Hibernate's event model and interceptors. Due to the lack of a first-level cache, Stateless sessions are vulnerable to data aliasing effects. A stateless session is a lower-level abstraction that is much closer to the underlying JDBC.
StatelessSession session = sessionFactory.openStatelessSession();
Transaction tx = session.beginTransaction();

ScrollableResults customers = session.getNamedQuery("GetCustomers")
    .scroll(ScrollMode.FORWARD_ONLY);
while ( customers.next() ) {
    Customer customer = (Customer) customers.get(0);
    customer.updateStuff(...);
    session.update(customer);
}

tx.commit();
session.close();
In this code example, the Customer instances returned by the query are immediately detached. They are never associated with any persistence context.

The insert(), update() and delete() operations defined by the StatelessSession interface are considered to be direct database row-level operations. They result in the immediate execution of a SQL INSERT, UPDATE or DELETE respectively. They have different semantics to the save(), saveOrUpdate() and delete() operations defined by the Session interface.

Solution 2

It seems you can do this with EclipseLink too. Check this : http://wiki.eclipse.org/EclipseLink/Examples/JPA/Pagination :

Query query = em.createQuery...
query.setHint(QueryHints.CURSOR, true)
     .setHint(QueryHints.SCROLLABLE_CURSOR, true)
ScrollableCursor scrl = (ScrollableCursor)q.getSingleResult();
Object o = null;
while ((o = scrl.next()) != null) { ... }

Solution 3

There are several techniques that may need to be used in conjunction with one another to create and manipulate queries for large data-sets where memory is a limitation:

Use setFetchSize(some value, maybe 100+) as the default (via JDBC) is 10. This is more about performance and is the single biggest related factor thereof. Can be done in JPA using queryHint available from provider (Hibernate, etc). There does not (for whatever reason) seem to be a JPA Query.setFetchSize(int) method.
Do not try to marshall the entire result-set for 10K+ records. Several strategies apply: For GUIs, use paging or a framework that does paging. Consider Lucene or commercial searching/indexing engines (Endeca if the company has the money). For sending data somewhere, stream it and flush the buffer every N records to limit how much memory is used. The stream may be flushed to a file, network, etc. Remember that underneath, JPA uses JDBC and JDBC keeps the result-set on the Server, only fetching N-rows in a row-set group at a time. This break-down can be manipulated to facilitate flushing data in groups.
Consider what the use-case is. Typically, an application is trying to answer questions. When the answer is to weed through 10K+ rows, then the design should be reviewed. Again, consider using indexing engines like Lucene, refine the queries, consider using BloomFilters as contains check caches to find needles in haystacks without going to the database, etc.

Solution 4

Raw SQL shouldn't be considered a last resort. It should still be considered an option if you want to keep things "standard" on the JPA tier, but not on the database tier. JPA also has support for native queries where it will still do the mapping to standard entities for you.

However, if you have a large result set that cannot be processed in the database, then you really should just use plain JDBC as JPA (standard) does not support streaming of large sets of data.

It will be harder to port your application across different application servers if you use JPA implementation specific constructs since the JPA engine is embedded in the application server and you may not have a control on which JPA provider is being used.

View more solutions

48,572

Author by

Roman

Updated on October 21, 2020

Comments

Roman over 3 years
I need to make my web-app work with really huge datasets. At the moment I get either OutOfMemoryException or output which is being generated 1-2 minutes.

Let's put it simple and suppose that we have 2 tables in DB: Worker and WorkLog with about 1000 rows in the first one and 10 000 000 rows in the second one. Latter table has several fields including 'workerId' and 'hoursWorked' fields among others. What we need is:
1. count total hours worked by each user;
2. list of work periods for each user.
The most straightforward approach (IMO) for each task in plain SQL is:

1)
```
select Worker.name, sum(hoursWorked) from Worker, WorkLog 
   where Worker.id = WorkLog.workerId 
   group by Worker.name;

//results of this query should be transformed to Multimap<Worker, Long>
```
2)
```
select Worker.name, WorkLog.start, WorkLog.hoursWorked from Worker, WorkLog
   where Worker.id = WorkLog.workerId;

//results of this query should be transformed to Multimap<Worker, Period>
//if it was JDBC then it would be vitally 
//to set resultSet.setFetchSize (someSmallNumber), ~100
```
So, I have two questions:
1. how to implement each of my approaches with JPA (or at least with Hibernate);
2. how would you handle this problem (with JPA or Hibernate of course)?
Roman about 14 years

@Pascal Thivent: thanks for the answer! About volumes: I don't know the real volumes, I just specified the maximum (in my opinion which is based on some knowledge of the domain). Maybe the real volume is 10-100 times less and IMHO the solution for these volumes will be ok too.
nalply over 11 years

Optimized? Can you explain this?
Guido over 11 years

Do you know what they exactly mean by "stateless sessions are vulnerable to data aliasing effects"? Thanks.
Shiladittya Chakraborty over 8 years

setHint method shoing undefined .
phil294 about 7 years

This is in no way faster. In fact, it is extremely slow and way less performant than the usual use of EntityManager.
phil294 about 7 years

this. I found that executing a connection query manually (session.doWork or such) is in fact the fastest you can get
Archimedes Trajano about 7 years

the standard EntityManager has no doWork operation.
phil294 about 7 years

yes, that's why I wrote session which you can get via entityManager.unwrap(Session.class);. Idk if that is bad programming style though. I guess one could also write a sessionFactory Bean
Archimedes Trajano about 7 years

That would likely be hibernate specific. There's no Session class for JPA. For the benefit of others can you update your comment to state the full package of the session class?
phil294 about 7 years

I cannot update the comment anymore, so: Idea says: org.hibernate.Session.
Darrell Teague almost 3 years

JDBC did a lot of things right but... respectfully - it is not good for large result sets. While the client-side tweaking of setFetchSize(int) is at the core of making this functional in terms of controlling client-side memory - it does not inherently solve the problem. That is, when one is looking for a (few) records in hundreds of thousands or millions - full text search engines provide superior performance and use-case support.
Darrell Teague almost 3 years

Depending on the resulting list size, this may crash and burn with a large data set. Have found that the general pattern of storing RI data in a database, then using a search engine (Lucene et al) on top of that is superior. Provides full-text search pattern support, superior performance, built-in paging without overburdening client memory requirements, etc. In short, almost never just return (some) Collection in response to (arbitrary) user queries in that it may be thousands (or millions?) of records in size.