How to handle a large set of data using Spring Data Repositories?

17,064

Solution 1

We have the classical consulting answer here: it depends. As the implementation of the method is store specific, we depend on the underlying store API. In case of JPA there's no chance to provide streaming access as ….getResultList() returns a List. Hence we also expose the List to the client as especially JPA developers might be used to working with lists. So for JPA the only option is using the pagination API.

For a store like Neo4j we support the streaming access as the repositories return Iterable on CRUD methods as well as on the execution of finder methods.

Solution 2

The implementation of findAll() simply loads the entire list of all entities into memory. Its Iterable return type doesn't imply that it implements some sort of database level cursor handling.

On the other hand your custom myQuery(Pageable) method will only load one page worth of entities, because the generated implementation honours its Pageable parameter. You can declare its return type either as Page or List. In the latter case you still receive the same (restricted) number of entities, but not the metadata that a Page would additionally carry.

So you basically did the right thing to avoid loading all entities into memory in your custom query.

Please review the related documentation here.

Solution 3

I think what you are looking for is Spring Data JPA Stream. It brings a significant performance boost to data fetching particularly in databases with millions of record. In your case you have several options which you can consider

  1. Pull all data once in memory
  2. Use pagination and read pages each time
  3. Use something like Apache Spark
  4. Streaming data using Spring Data JPA

In order to make Spring Data JPA Stream to work, we need to modify our MyRepository to return Stream<MyEntity> like this:

public interface MyRepository extends PagingAndSortingRepository<MyEntity, Integer> {
    @QueryHints(value = {
        @QueryHint(name = HINT_CACHEABLE, value = "false"),
        @QueryHint(name = READ_ONLY, value = "true")
    })
    @Query(value="SELECT * ...")
    Stream<MyEntity> myQuery();
}

In this example, we disable second level caching and hint Hibernate that the entities will be read only. If your requirement is different, make sure to change those settings accordingly for your requirements.

Share:
17,064
José Ricardo
Author by

José Ricardo

Computer Scientist

Updated on June 03, 2022

Comments

  • José Ricardo
    José Ricardo almost 2 years

    I have a large table that I'd like to access via a Spring Data Repository.

    Currently, I'm trying to extend the PagingAndSortingRepository interface but it seems I can only define methods that return lists, eg.:

    public interface MyRepository extends 
            PagingAndSortingRepository<MyEntity, Integer>
    {
      @Query(value="SELECT * ...")
      List<MyEntity> myQuery(Pageable p);
    }
    

    On the other hand, the findAll() method that comes with PagingAndSortingRepository returns an Iterable (and I suppose that the data is not loaded into memory).

    Is it possible to define custom queries that also return Iterable and/or don't load all the data into memory at once?

    Are there any alternatives for handling large tables?