Multi threading database reading

10,341

Solution 1

Network

First of all, since using rowid and rownum is vendor-lock anyway, you should consider using database stored routines. It could significantly reduce overhead of transmitting data from database to the application server (especially if they are on different machines and connected through network).

Considering that you have 80 million records to transmit, that could be the best performance boost for you, though it depends on kind of work your threads do.

Obviously increasing bandwidth would also help to solve networking issues.

Disk performance

Before making changes in code check the hard drive load while tasks running, perhaps it just can't handle that much I/O (10 threads reading simultaneously).

Migrating to SSD/RAID or clustering database might solve the issue. While changing the way you access database won't in that case.

Multithreading could solve CPU problems, but databases mostly depend on disk system.

Rownum

There are a couple of problems you might face if you will be implementing it using rowid and rownum.

1) rownum is generated on the fly for each query's results. So if query doesn't have explicit sorting and it is possible that some record have different rownum every time you run query.

For example you run it first time and get results like this:

some_column | rownum
____________|________
     A      |    1
     B      |    2
     C      |    3

then you run it second time, since you don't have explicit sorting, dbms (for some reason known to itself) decides to return results like this:

some_column | rownum
____________|________
     C      |    1
     A      |    2
     B      |    3

2) point 1 also implies that if you will be filtering results on rownum it will generate temporary table with ALL results and then filter it

So rownum is not a good choice for splitting results. While rowid seemed better, it has some issues too.

Rowid

If you look at the ROWID description you may notice that "rowid value uniquely identifies a row in the database".

Because of that and the fact that when you delete a row you have a "hole" in rowid sequence, rowids may be distributed not equally among table records.

So for example if you have three threads and each fetching 1'000'000 rowids, it is possible that one will get 1'000'000 records and other two 1 record each. So one will be overwhelmed, while two other starving.

It might be not a big deal in your case, though it very well might be the problem you are facing currently with primary key pattern.

Or if you first fetch all rowids in dispatcher and then divide them equally (like peter.petrov suggested) that could do the thing, though fetching 80 million ids still sounds like a lot, I think it would be better to do the splitting with one sql-query that returns borders of chunks.

Or you might solve that problem by giving low amount of rowids per task and using Fork-Join framework introduced in Java 7, however it should be used carefully.

Also obvious point: both rownum and rowid are not portable across databases.

So it is much better to have your own "sharding" column but then you will have to make sure yourself that it splits records in more or less equal chunks.


Also keep in mind that if you are going to do it in several threads it is important to check what locking mode database uses, perhaps it just locks the table for every access, then multithreading is pointless.

As others suggested, you'd better first find what is the main reason of low performance (network, disk, database locking, thread starvation or maybe you just have suboptimal queries - check the query plans).

Solution 2

  1. Create a dispatching thread which reads the PKs (the IDs) of N rows. Here you can do some sort of caching - read N=1000 rows, give them to Worker1, read next N=1000 rows, give them to Worker2, etc. This way, you don't need to keep more than N=1000 IDs (PKs) in-memory in the dispatcher thread. Once you pass the work (the work is these N=1000 IDs) to the Worker thread, you dispose them in the dispatcher thread (no need to keep them).

  2. Each worker thread takes its N (e.g. 1000) PKs/IDs and using them gets the rows from the DBs. Make sure here that you use rowlock (T-SQL) or its equivalent if you're not using SQL Server. This way, the threads will not get into each other's way. So worker reads N rows from the DB and processes them. Once complete it may signal the dispatcher (something like "I am done" event).

This is the initial idea that comes to my mind. I guess it could be refined further if you think some more on it.

Solution 3

As noted in the comments, multithreading might not help and may even make matters worse.

Standard alternatives for any query are:

  1. See what the query plan is and see if rejigging the SQL can improve the query plan.
  2. Do more processing on the database side - but since you are doing a SELECT COUNT... there is nothing more you can do here!
  3. See if you can rework the query to do incremental computation based on new or updated rows since the last run, instead of querying over all old the data every time.

None of these are guaranteed to succeed. It would depend on what you are trying to do.

Share:
10,341
Malatesh
Author by

Malatesh

I am enthusiastic programmer, having 11+ experience in software industry. Product development expertise with technologies. Core Java, J2EE, Spring, Spring Boot, JavaScript, Elasticsearch, Kibana, Microservices, Docker containers, Data Structures, Algorithms. Oracle Database Linux Domains: Storage Backup Management, Finance, E-Commerce, CRM Vmware virtulization etc.

Updated on June 04, 2022

Comments

  • Malatesh
    Malatesh almost 2 years

    In our Java application I have requirement to read the around 80 million records from oracle database. I am trying to redesign the multithreading program for this. Currently we are using Java 5 thread pooling with 10 threads reading the data base in parallel based on a primary key pattern. Each thread will read different pattern like 001* and 002*.

    How can I improve the performance of this program? I am thinking of design pattern, having lead thread to read the database and delegate the processing to child threads. In our existing design, different threads access the table by having 10 jdbc connections. With the new approach I will have only one thread reading the table.

    We have different select statement for each thread like select count(*) from "+table+ " where id like '"+format + "%'".

    Ok sounds reading rowid pattern good, but is it good to read by rowid or rownum ?

    Can any body please pros and cons of new approach and is there other way we can implement it.