How to handle large dataset with JPA (or at least with Hibernate)?
Solution 1
suppose that we have 2 tables in DB: Worker and WorkLog with about 1000 rows in the first one and 10 000 000 rows in the second one
For high volumes like this, my recommendation would be to use The StatelessSession
interface from Hibernate:
Alternatively, Hibernate provides a command-oriented API that can be used for streaming data to and from the database in the form of detached objects. A
StatelessSession
has no persistence context associated with it and does not provide many of the higher-level life cycle semantics. In particular, a stateless session does not implement a first-level cache nor interact with any second-level or query cache. It does not implement transactional write-behind or automatic dirty checking. Operations performed using a stateless session never cascade to associated instances. Collections are ignored by a stateless session. Operations performed via a stateless session bypass Hibernate's event model and interceptors. Due to the lack of a first-level cache, Stateless sessions are vulnerable to data aliasing effects. A stateless session is a lower-level abstraction that is much closer to the underlying JDBC.StatelessSession session = sessionFactory.openStatelessSession(); Transaction tx = session.beginTransaction(); ScrollableResults customers = session.getNamedQuery("GetCustomers") .scroll(ScrollMode.FORWARD_ONLY); while ( customers.next() ) { Customer customer = (Customer) customers.get(0); customer.updateStuff(...); session.update(customer); } tx.commit(); session.close();
In this code example, the
Customer
instances returned by the query are immediately detached. They are never associated with any persistence context.The
insert(), update()
anddelete()
operations defined by theStatelessSession
interface are considered to be direct database row-level operations. They result in the immediate execution of a SQLINSERT, UPDATE
orDELETE
respectively. They have different semantics to thesave(), saveOrUpdate()
anddelete()
operations defined by theSession
interface.
Solution 2
It seems you can do this with EclipseLink too. Check this : http://wiki.eclipse.org/EclipseLink/Examples/JPA/Pagination :
Query query = em.createQuery...
query.setHint(QueryHints.CURSOR, true)
.setHint(QueryHints.SCROLLABLE_CURSOR, true)
ScrollableCursor scrl = (ScrollableCursor)q.getSingleResult();
Object o = null;
while ((o = scrl.next()) != null) { ... }
Solution 3
There are several techniques that may need to be used in conjunction with one another to create and manipulate queries for large data-sets where memory is a limitation:
- Use setFetchSize(some value, maybe 100+) as the default (via JDBC) is 10.
This is more about performance and is the single biggest related factor thereof.
Can be done in JPA using queryHint available from provider (Hibernate, etc).
There does not (for whatever reason) seem to be a JPA
Query.setFetchSize(int)
method. - Do not try to marshall the entire result-set for 10K+ records. Several strategies apply: For GUIs, use paging or a framework that does paging. Consider Lucene or commercial searching/indexing engines (Endeca if the company has the money). For sending data somewhere, stream it and flush the buffer every N records to limit how much memory is used. The stream may be flushed to a file, network, etc. Remember that underneath, JPA uses JDBC and JDBC keeps the result-set on the Server, only fetching N-rows in a row-set group at a time. This break-down can be manipulated to facilitate flushing data in groups.
- Consider what the use-case is. Typically, an application is trying to answer questions. When the answer is to weed through 10K+ rows, then the design should be reviewed. Again, consider using indexing engines like Lucene, refine the queries, consider using BloomFilters as contains check caches to find needles in haystacks without going to the database, etc.
Solution 4
Raw SQL shouldn't be considered a last resort. It should still be considered an option if you want to keep things "standard" on the JPA tier, but not on the database tier. JPA also has support for native queries where it will still do the mapping to standard entities for you.
However, if you have a large result set that cannot be processed in the database, then you really should just use plain JDBC as JPA (standard) does not support streaming of large sets of data.
It will be harder to port your application across different application servers if you use JPA implementation specific constructs since the JPA engine is embedded in the application server and you may not have a control on which JPA provider is being used.
Roman
Updated on October 21, 2020Comments
-
Roman over 3 years
I need to make my web-app work with really huge datasets. At the moment I get either OutOfMemoryException or output which is being generated 1-2 minutes.
Let's put it simple and suppose that we have 2 tables in DB:
Worker
andWorkLog
with about 1000 rows in the first one and 10 000 000 rows in the second one. Latter table has several fields including 'workerId' and 'hoursWorked' fields among others. What we need is:count total hours worked by each user;
list of work periods for each user.
The most straightforward approach (IMO) for each task in plain SQL is:
1)
select Worker.name, sum(hoursWorked) from Worker, WorkLog where Worker.id = WorkLog.workerId group by Worker.name; //results of this query should be transformed to Multimap<Worker, Long>
2)
select Worker.name, WorkLog.start, WorkLog.hoursWorked from Worker, WorkLog where Worker.id = WorkLog.workerId; //results of this query should be transformed to Multimap<Worker, Period> //if it was JDBC then it would be vitally //to set resultSet.setFetchSize (someSmallNumber), ~100
So, I have two questions:
- how to implement each of my approaches with JPA (or at least with Hibernate);
- how would you handle this problem (with JPA or Hibernate of course)?
-
Roman about 14 years@Pascal Thivent: thanks for the answer! About volumes: I don't know the real volumes, I just specified the maximum (in my opinion which is based on some knowledge of the domain). Maybe the real volume is 10-100 times less and IMHO the solution for these volumes will be ok too.
-
nalply over 11 yearsOptimized? Can you explain this?
-
Guido over 11 yearsDo you know what they exactly mean by "stateless sessions are vulnerable to data aliasing effects"? Thanks.
-
Shiladittya Chakraborty over 8 yearssetHint method shoing undefined .
-
phil294 about 7 yearsThis is in no way faster. In fact, it is extremely slow and way less performant than the usual use of
EntityManager
. -
phil294 about 7 yearsthis. I found that executing a connection query manually (
session.doWork
or such) is in fact the fastest you can get -
Archimedes Trajano about 7 yearsthe standard EntityManager has no
doWork
operation. -
phil294 about 7 yearsyes, that's why I wrote
session
which you can get viaentityManager.unwrap(Session.class);
. Idk if that is bad programming style though. I guess one could also write asessionFactory
Bean -
Archimedes Trajano about 7 yearsThat would likely be hibernate specific. There's no Session class for JPA. For the benefit of others can you update your comment to state the full package of the session class?
-
phil294 about 7 yearsI cannot update the comment anymore, so: Idea says:
org.hibernate.Session
. -
Darrell Teague almost 3 yearsJDBC did a lot of things right but... respectfully - it is not good for large result sets. While the client-side tweaking of setFetchSize(int) is at the core of making this functional in terms of controlling client-side memory - it does not inherently solve the problem. That is, when one is looking for a (few) records in hundreds of thousands or millions - full text search engines provide superior performance and use-case support.
-
Darrell Teague almost 3 yearsDepending on the resulting list size, this may crash and burn with a large data set. Have found that the general pattern of storing RI data in a database, then using a search engine (Lucene et al) on top of that is superior. Provides full-text search pattern support, superior performance, built-in paging without overburdening client memory requirements, etc. In short, almost never just return (some) Collection in response to (arbitrary) user queries in that it may be thousands (or millions?) of records in size.