[SERVER-43417] Signal the flusher thread to flush instead of calling waitUntilDurable when waiting for {j:true} Created: 23/Sep/19  Updated: 06/Dec/22  Resolved: 05/Feb/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Lingzhi Deng Assignee: Backlog - Replication Team
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-43135 Introduce a future-based API for wait... Closed
Duplicate
is duplicated by SERVER-45665 Make JournalFlusher flush on command ... Closed
Related
is related to SERVER-43658 Only call waitUntilDurable inline for... Closed
Assigned Teams:
Replication
Participants:

 Description   

_lastSyncMutex is one of the most contended mutexes when running with {j: true}. This optimization reduces the contention on _lastSyncMutex and avoid serializing {j: true} writer for journal waiting. This also reduces sync-related I/O rate when there are lots of {j: true} writer but will also avoid significant delay in journal waiting.



 Comments   
Comment by Dianna Hohensee (Inactive) [ 05/Feb/20 ]

The work for this ticket has been done in SERVER-45665. Closing.

Comment by Lingzhi Deng [ 02/Oct/19 ]

signal a flusher thread and wait for last durable opTime

After talking to daniel.gottlieb, we found that this part is problematic under today's implementation of lastDurable opTime. This is because lastDurable could include oplog holes. Imagine the case where two transactions, TXN1 and TXN2, with OpTime 1 and 2 respectively are in flight and TXN2 commits before TXN1 does. At this time, TXN2 will trigger flush and the flusher sets lastDurbale to OpTime 2 once it’s done. And now OpTime2 is durable/journaled but not OpTime1. And when TXN1 commits later, if we check if lastDurable > OpTime1, and it does, so it will mistakenly return without journaling OpTime1. So now if the server crashes and restarts, OpTime1 is gone. This is only a problem for {w: 1, j: true} because before PM-1274, journaling is implied for w > 1 as secondaries are not able see those oplog entries due to oplog visibility rules.

The takeaway is that today, we can not do direct comparisons with lastDurable to check whether an OpTime is already journaled even though the OpTime in question is < lastDurable (equality is fine I guess).

After PM-1274, I believe lastDurable will no longer be ahead of allCommitted (daniel.gottlieb to confirm). So the aforementioned approach will still work. However, this introduce a behavior change / semantic change to {j: true}. If we pin lastDurable to allCommitted, {j: true} will have to wait for all concurrent transactions that have earlier OpTime to commit (i.e. no hole). Thus, this could potentially mean bigger latency for {j: true} writers.

An alternative is to use a counter as an indicator of whether a log flush has happened since "my" request. Each log flush request (trigger) gets a number under lock for the next log flush, and the flusher makes a cutoff under lock before it actually flushes. It is like buying tickets for the next train and having a cutoff before the train leaves.

Here is my POC using the lastDurable with the future-based api introduced in SERVER-43135 (see wiredtiger_kv_engine.cpp and write_concern.cpp). And the performance gain for {w: majority} (which does not have the problem mentioned above):

{w: majority} 1 32 64 256 1024
before 272 3424 5482 10529 10038
after 271 3513 5901 12605 17737

I believe the performance gain came from reducing contention in _lastSyncMutex in waitUntilDurable. Under high number of concurrent writers (with j: true), all writers will block on mutex even though some of them might find someone else has synced already after they are granted the mutex. I suspect there is some kinds of thundering herd problems going on here. Also note that, the window for two log flush requests to sync at once is fairly small (line 280 - line 289). If a waiter comes after the _lastSyncTime is incremented, it will have to wait for its turn to flush again even though technically it is already synced by the previous flush.

Here is some profiling data. We can see that the average time taken in WiredTigerSessionCache::waitUntilDurable is significantly longer than the time it spends on __session_log_flush. This suggests contention on the mutex.

Func Min (us) Max (us) Avg (us) Count Total (usec)
__session_log_flush 5 46198 1598 11324 18099227
WiredTigerSessionCache::waitUntilDurable 13 148483 10974 357477 3923132716

To sum up, my high level idea is to have a single thread handle all flush requests and waiters wait for notifications (Open question: lastDurable? a counter? or ?).

CC: milkie, geert.bosch

Comment by Lingzhi Deng [ 26/Sep/19 ]

SERVER-41392 and PM-1274 will be changing how the oplog manager works. But it would still be useful in the future to have the ability to signal a flusher thread and wait for last durable opTime. I am taking this out from PM-1456 and will re-visit this once PM-1274 is done.

Generated at Thu Feb 08 05:03:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.