Java Spring: How to efficiently read and save large amount of data from a CSV file?

11,339

This GitHub repo compares 5 different methods of batch inserting data. Acc. to him, using JdbcTemplate is the fastest (he claims 500000 records in 1.79 [+- 0.50] seconds). If you use JdbcTemplate with Spring Data, you'll need to create a custom repository; see this section in the docs for detailed instructions about that.

Spring Data CrudRepository has a save method that takes an Iterable, so you can use that too, although you'll have to time it to see how it performs against the JdbcTemplate. Using Spring Data, the steps are as follows (taken from here with some edit)

  1. Add: rewriteBatchedStatements=true to the end of the connection string.
  2. Make sure you use a generator that supports batching in your entity. E.g.

    @Id
    @GeneratedValue(generator = "generator")
    @GenericGenerator(name = "generator", strategy = "increment")
    
  3. Use the: save(Iterable<S> entities) method of the CrudRepository to save the data.

  4. Use the: hibernate.jdbc.batch_size configuration.

The code for the solution #2 is here.

As for using multiple threads, remember that writing to the same table in the database from multiple threads may produce table level contentions and produce worse results. You will have to try and time it. How to write multithreaded code using project Reactor is a completely separate topic that's out of the scope here.

HTH.

Share:
11,339
V. Samma
Author by

V. Samma

Updated on June 07, 2022

Comments

  • V. Samma
    V. Samma almost 2 years

    I am developing a web application in Java Spring where I want the user to be able to upload a CSV file from the front-end and then see the real-time progress of the importing process and after importing he should be able to search individual entries from the imported data.

    The importing process would consist of actually uploading the file (sending it via REST API POST request) and then reading it and saving its contents to a database so the user would be able to search from this data.

    What would be the fastest way to save the data to the database? Just looping over the lines and creating a new class object and saving it via JPARepository for each line takes too much time. It took around 90s for 10000 lines. I need to make it a lot faster. I need to add 200k rows in a reasonable amount of time.

    Side Notes:

    I saw Asynchronous approach, with Reactor. This should be faster as it uses multiple threads and the order of saving the rows basically isn't important (although the data has ID-s in the CSV).

    Then I also saw Spring Batch jobs, but all of the examples use SQL. I am using repositories so I'm not sure if I can use it or whether it's the best approach.

  • V. Samma
    V. Samma almost 7 years
    Sorry, but I am creating a simple application currently. I am using JPA repository approach which saves the data to H2 database. I saw that it's possible to use SQLServer with Java Spring, but this solution seems a little too complicated for a problem like that. I just want to trigger the file processing after the user uploads it. A big data ETL solution may be a little too much at the moment.
  • V. Samma
    V. Samma almost 7 years
    Hi, I wasn't able to try your suggestions before now. But I think that before using the save method, I would have to read in the lines from the CSV file (I am currently using FileReader for reading and then CSVFormat.parse for parsing the lines to a Iterable<CSVRecord>) and then loop over them, creating my Entity class objects for each and then add these to a some kind of an Iterable for saving to the database? I mean, it is possible that my CSV file reading, iterating over the rows and creating class instances may be the time consuming part. But I will try the JdbcTemplate first.
  • V. Samma
    V. Samma almost 7 years
    Okay, I tried JdbcTemplate and the results are very good. If by looping and saving to DB one by one it took around 0.86s for 100 rows, 4.3s for 1000 rows and 112s for 10k rows (I didn't even try larger files), then now I got those 10k rows read in from the file and saved to DB in 1.13s and the whole 200k row file took 9 seconds. Of course it's 4-5 times slower still than his example (I have an entity with 5 properties compared to his 1) but it's good enough for me :) Thanks a lot!