Project has Spring Boot with JPA. We have a table vehicle with 1m records. Table has an indexed field type.
We have a use case, where we want to fetch all the records by type. For each type, we get all the vehicle records, then for next type, then for next and so on.
Since there are 1m records, we are fetching the records for each type with the batch size of 1000. We also apply a filter with type column.
VehicleRepository.java
Page<VehicleRecord> findByType(String type, Pageable pageable);
VehicleService.java
for (String type: vehicleTypes) {
Pageable pageable = PageRequest.of(0, 1000, Sort.by("updated_at").ascending());
Page<VehicleRecord> vehicles = null;
do {
vehicles = vehicleRepository.findByType(type, pageable);
// do something with vehicles
pageable = pageable.next();
} while (vehicles.hasNext());
}
For the sake of understanding, assume there are 5 types of records:
- A - 0 vehicles
- B - 100000 vehicles
- C - 0 vehicles
- D - 0 vehicles
- E - 0 vehicles
Problem:
In this, when fetching data for A, findByType completes in < 100ms. which is good.
When fetching for B though, first fetch with LIMIT 1000 OFFSET 0 takes ~200ms. But it is downhill from here, with increase in OFFSET value, time also increases. By the time LIMIT is 1000 and OFFSET 90000, findByType takes 6000-7000ms.
What is further confusing is that after data is fetched for B, rest of the types (C, D & E) each take 3000-4000ms, when they have 0 data.
I am not sure what is happening here. I read somewhere that it is because of high OFFSET value that the method is taking so much time. But that does not explain why the method would take so much time for C, D, and E.
Any inputs would be helpful. Thanks
Edit 1: Profiling results (Visual VM)
- SQL queries are executing normally, they take hardly 150-200ms, even for high offset values.
- This is unexpected, that the vehicles collection keeps adding vehicle records to it after each iteration (observed this in memory section of profiler). I expected "live object" count to remain to max 1000, because that's what our limit size is. But after each iteration, it keeps adding 1000 records to it. Even after performing manual GC from profiler, it won't free that memory, until all the iterations for for loop are complete.
CodePudding user response:
Chris said something correct: It could be that you application doesn't know where it left of last time querying for "B", what happens is (pagesize 1000):
You request page 0: Find matching entries and add them to the result set. Once the result set has a size of 1000, return it.
You request page 1: Find (!) and skip the first 1000 matching entries. Take the matching entries 1001 to 2000, add them to the result set and ignore it.
You request page 2: Find (!) and skip the first 2000 matching entries. Take the matching entries 2001 to 3000, add them to the result set and ignore it.
... and so on.
So basically the database executes the query multiple times, each time increasing the total query time because the database doesn't know where it left of last time. A solution would be to somehow pass the last-fetched id (primary key) to the query and start from there (... AND id > :id). Maybe you
I compiled a sample application to test your findings. In my vehicle table are currently ~723k entries. Database and application ran on my local machine (pagesize 1000):
- Querying for A (0 entries) took about 10ms.
- Querying for B (0 entries) took about 2200ms.
- Querying for C (0 entries) took about 10ms.
- Querying for D (0 entries) took about 10ms.
- Querying for E (0 entries) took about 10ms.
So, I cannot reproduce your problem. Maybe you can cut down your code to be as simple as possible and share it with us (or find the bottleneck yourself).
I uploaded mine to my Github repository.
The results were:
A: 185ms
B: 2139ms
B: 2007ms
B: 1863ms
B: 1930ms
C: 2ms
D: 3ms
E: 2ms
A: 1ms
B: 2020ms
B: 2044ms
B: 2006ms
B: 2053ms
B: .. same average values all over
And to mention one thing, if you have lots of records in your database but only a small amount of different types, the index won't help much. Some SQL optimizer might ignore the index and perform a full table scan as the index cardinality could be too low.
CodePudding user response:
From the comments, the issue does not seem to be with the paging query itself, but with how it is used and the amount of data affects the JVM. The code snippet provided suggests you are calling vehicleRepository.findByType(type, pageable); multiple times within the same VehicleService method, with the implication being they would be all within the same EntityManager/transactional context. JPA requires EntityManager contexts cache every entity read through them, so that they can monitor and serialize any changes made to the database. If you are reading in large batches of entities, that builds up - EntityManagers are meant to represent units of work and not be long lived like that.
Solution(s) would be to break up each 'batch' into its own transactional context, with a call for each vehicle type.
Alternatively, you can get a handle on the EntityManager instance. After processing your entities, call EntityManager.clear() to have it release its references to all managed entities within it, and allow them to be garbage collected if you don't have application references to them.
