[SERVER-43417] Signal the flusher thread to flush instead of calling waitUntilDurable when waiting for {j:true} Created: 23/Sep/19 Updated: 06/Dec/22 Resolved: 05/Feb/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Lingzhi Deng | Assignee: | Backlog - Replication Team |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Assigned Teams: |
Replication
|
||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Description |
|
_lastSyncMutex is one of the most contended mutexes when running with {j: true}. This optimization reduces the contention on _lastSyncMutex and avoid serializing {j: true} writer for journal waiting. This also reduces sync-related I/O rate when there are lots of {j: true} writer but will also avoid significant delay in journal waiting. |
| Comments |
| Comment by Dianna Hohensee (Inactive) [ 05/Feb/20 ] | ||||||||||||||||||||||||||||||||||||
|
The work for this ticket has been done in | ||||||||||||||||||||||||||||||||||||
| Comment by Lingzhi Deng [ 02/Oct/19 ] | ||||||||||||||||||||||||||||||||||||
After talking to daniel.gottlieb, we found that this part is problematic under today's implementation of lastDurable opTime. This is because lastDurable could include oplog holes. Imagine the case where two transactions, TXN1 and TXN2, with OpTime 1 and 2 respectively are in flight and TXN2 commits before TXN1 does. At this time, TXN2 will trigger flush and the flusher sets lastDurbale to OpTime 2 once it’s done. And now OpTime2 is durable/journaled but not OpTime1. And when TXN1 commits later, if we check if lastDurable > OpTime1, and it does, so it will mistakenly return without journaling OpTime1. So now if the server crashes and restarts, OpTime1 is gone. This is only a problem for {w: 1, j: true} because before PM-1274, journaling is implied for w > 1 as secondaries are not able see those oplog entries due to oplog visibility rules. The takeaway is that today, we can not do direct comparisons with lastDurable to check whether an OpTime is already journaled even though the OpTime in question is < lastDurable (equality is fine I guess). After PM-1274, I believe lastDurable will no longer be ahead of allCommitted (daniel.gottlieb to confirm). So the aforementioned approach will still work. However, this introduce a behavior change / semantic change to {j: true}. If we pin lastDurable to allCommitted, {j: true} will have to wait for all concurrent transactions that have earlier OpTime to commit (i.e. no hole). Thus, this could potentially mean bigger latency for {j: true} writers. An alternative is to use a counter as an indicator of whether a log flush has happened since "my" request. Each log flush request (trigger) gets a number under lock for the next log flush, and the flusher makes a cutoff under lock before it actually flushes. It is like buying tickets for the next train and having a cutoff before the train leaves. Here is my POC using the lastDurable with the future-based api introduced in
I believe the performance gain came from reducing contention in _lastSyncMutex in waitUntilDurable. Under high number of concurrent writers (with j: true), all writers will block on mutex even though some of them might find someone else has synced already after they are granted the mutex. I suspect there is some kinds of thundering herd problems going on here. Also note that, the window for two log flush requests to sync at once is fairly small (line 280 - line 289). If a waiter comes after the _lastSyncTime is incremented, it will have to wait for its turn to flush again even though technically it is already synced by the previous flush. Here is some profiling data. We can see that the average time taken in WiredTigerSessionCache::waitUntilDurable is significantly longer than the time it spends on __session_log_flush. This suggests contention on the mutex.
To sum up, my high level idea is to have a single thread handle all flush requests and waiters wait for notifications (Open question: lastDurable? a counter? or ?). CC: milkie, geert.bosch | ||||||||||||||||||||||||||||||||||||
| Comment by Lingzhi Deng [ 26/Sep/19 ] | ||||||||||||||||||||||||||||||||||||
|
|