[SERVER-53667] High rate of journal flushes on secondary in 4.4 Created: 08/Jan/21 Updated: 19/Dec/22 Resolved: 13/Jun/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Bruce Lucas (Inactive) | Assignee: | Matthew Russotto |
| Resolution: | Duplicate | Votes: | 1 |
| Labels: | perf-escapes | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | Execution Team 2021-01-25, Repl 2021-03-08, Repl 2021-03-22, Repl 2021-04-05, Repl 2022-03-21, Repl 2022-04-04, Repl 2022-04-18, Repl 2022-05-02, Repl 2022-05-16, Repl 2022-05-30, Repl 2022-06-13, Repl 2022-06-27 | ||||||||||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
A simple workload with a single thread doing updates as fast as possible sees the following:
This creates a much higher i/o requirement on the secondary in 4.4, and much higher on the secondary than the primary. Two possible issues:
|
| Comments |
| Comment by Matthew Russotto [ 13/Jun/22 ] | ||||||||||||||||
|
The issue of doing two journal flushes per batch was fixed in | ||||||||||||||||
| Comment by Lingzhi Deng [ 08/Mar/21 ] | ||||||||||||||||
|
| ||||||||||||||||
| Comment by Eric Lafontaine [ 08/Mar/21 ] | ||||||||||||||||
|
Hi All, Just wondering if this will be fixed in Mongo 4.x or only in Mongo 5.x as per the bug resolution ticket. The difference is huge on our side if the fix is backward incompatible. Regards, Eric Lafontaine | ||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 05/Mar/21 ] | ||||||||||||||||
|
jason.chan, it's not completely clear to me whether the increased number of batches is the real issue or whether changing the number of batches is the solution, so I'd like to keep this ticket open until we've investigated that question. | ||||||||||||||||
| Comment by Jason Chan [ 04/Mar/21 ] | ||||||||||||||||
|
Filed | ||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 03/Mar/21 ] | ||||||||||||||||
|
Here's the simple workload that was used to generate the numbers above. Setup:
Workload:
| ||||||||||||||||
| Comment by Dianna Hohensee (Inactive) [ 27/Jan/21 ] | ||||||||||||||||
|
I'm passing this back to replication to determine how to reduce (or parallelize) journal flush calls. | ||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 08/Jan/21 ] | ||||||||||||||||
|
I assigned this to replication based on evin.roesle's suggestion, but PM-1274 was done by the execution team, so not sure who should look into this. | ||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 08/Jan/21 ] | ||||||||||||||||
|
I suspect this is a side effect of PM-1274: in 4.2 we journal before replicating, so batch rate and therefore flush rate on secondary is determined by flush rate on primary. But in 4.4 we replicate before journaling, so the batch rate is much higher, often equal to operation rate. |