[SERVER-47898] Advancing lastDurable irrespective of lastApplied Created: 01/May/20 Updated: 25/Jan/24 Resolved: 11/Dec/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 7.3.0-rc0 |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Kelsey Schubert | Assignee: | Kishore Devireddy |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | PM-3489-Milestone-MiscImprovement-CP, former-quick-wins | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||
| Assigned Teams: |
Replication
|
||||||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||||||||||
| Backport Requested: |
v4.4
|
||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | Repl 2020-08-24, Execution Team 2020-10-19, Repl 2023-12-11, Repl 2023-12-25 | ||||||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||||||
| Linked BF Score: | 33 | ||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
lastDurable is set asynchronously from lastApplied. This can cause us to attempt to advance lastDurable beyond lastApplied. When we attempt to advance lastDurable beyond lastApplied, we skip advancing lastDurable. This can increase latency for w:"majority" writes, which wait for lastDurable to advance on a majority. It also causes odd behavior, where w:1,j:true writes return success, but durableOpTime in replSetGetStatus is not updated. When we attempt to advance lastDurable beyond lastApplied, we should advance lastDurable .
Original description: The comments surrounding this log line suggest that this is unexpected behavior: https://github.com/mongodb/mongo/blob/r4.4.0-rc3/src/mongo/db/repl/member_data.cpp#L163-L173 Observed on an internal cluster with a primary running 4.4.0-rc3 and secondaries on 4.2.1. Happy to increase logging level if it would be helpful to understand the behavior here. |
| Comments |
| Comment by Githook User [ 11/Dec/23 ] | ||
|
Author: {'name': 'Kishore Devireddy', 'email': 'kishore.devireddy@mongodb.com', 'username': 'kishorekrd'}Message: GitOrigin-RevId: 9c3067d69a4df40ef9cad9e284000e5b236c216e | ||
| Comment by Kishore Devireddy [ 11/Dec/23 ] | ||
| Comment by Githook User [ 21/Oct/20 ] | ||
|
Author: {'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}Message: Revert " This reverts commit 2ebea0648a4bd2d31abb5251d66b2b7925bb06a8. | ||
| Comment by Githook User [ 15/Oct/20 ] | ||
|
Author: {'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}Message: | ||
| Comment by Tess Avitabile (Inactive) [ 16/Sep/20 ] | ||
Yes, today we advance both lastDurable and lastApplied after the batch is applied. In the future, we could consider separating setting lastDurable and lastApplied in order to acknowledge writes faster, but we don't do this today. | ||
| Comment by Dianna Hohensee (Inactive) [ 16/Sep/20 ] | ||
|
The oplogTruncateAfterPoint updates are independent. We update the oplogTruncateAfterPoint with WT's all_durable timestamp prior to the journal flush. Then we try to update lastDurable after the journal flush, with the added criteria that lastDurable can only be updated with a real oplog timestamp, so we read an oplog entry <= all_durable with which to attempt to set repl's lastDurable. The oplogTruncateAfterPoint does not have the criteria of a real oplog timestamp, because WT's all_durable doesn't, and oplog truncation works with that – truncation used to be inclusive, so the +1 recovery trick, I assumed motivated this. | ||
| Comment by Daniel Gottlieb (Inactive) [ 16/Sep/20 ] | ||
How should I think about what lastDurable and lastApplied mean on secondaries? Particularly when the system is in the state where oplog entries have been written, but the data writes are incomplete. Are we saying that lastDurable won't be advanced during oplog application until after the batch completes? (maybe that's already true somehow). Are we going to bump lastApplied even though the data writes aren't done? (and presumably re-work how secondary reads at lastApplied need to work). | ||
| Comment by Tess Avitabile (Inactive) [ 16/Sep/20 ] | ||
|
I think it would make sense to let lastDurable and lastApplied be independent and relax our rule that lastDurable <= lastApplied, particularly for the consensus arbiters use case. However, I'm not certain where we make this assumption, so relaxing it would be a larger investigation. Since in our current system, it's safe to assume lastDurable <= lastApplied, it would be more straight-forward to allow bumping lastDurable to bump lastApplied, in order to address the concern with majority write performance. I'm not certain whether durableOpTime is one-to-one with oplogTruncateAfterPoint. | ||
| Comment by Daniel Gottlieb (Inactive) [ 16/Sep/20 ] | ||
|
Apologies if this is the wrong place to ask: When I think of lastDurable value, the guarantee in my mental model is that all oplog timestamps at that time and earlier are accounted for and persisted on disk with an fsync in write-ahead-log files. When I think of lastApplied, I see that as some relaxed notion (due to holes on primaries) of what operations are reflected in the data that a (standard, non-oplog) reader can see. In our implementation, on primaries, the relationship makes sense that lastDurable advancing can also advance lastApplied as the data write is atomic with the oplog write. But on secondaries, the relationship (based on my definition) is not necessarily so. The oplog writes happen first and lastDurable can be technically be advanced (in an effort to minimize the latency for moving the majority point) without applying the data writes. I'm not sure if our code does that. When I think of the arbiters with oplog project (maybe no longer a real project we plan on doing?), it's natural for those theoretical processes to have a lastDurable (they can vote on majority points) but not a lastApplied (they cannot service data reads, just oplog reads). Sorry for the words, here's an actual question: Why is the right thing to do also advance lastApplied in-toe with lastDurable as opposed to letting lastDurable be independent (i.e: relax our rules/invariants regarding lastDurable > lastApplied)? If independence isn't suitable, can someone provide a better way for me to think about what lastApplied and lastDurable represent? Third question related to:
Is the durableOpTime 1 to 1 with oplogTruncateAfterPoint? If we skip advancing ("hold back") the durableOpTime will we also hold back the oplogTruncateAfterPoint? | ||
| Comment by Tess Avitabile (Inactive) [ 15/Sep/20 ] | ||
|
Thanks for thinking this through with me! | ||
| Comment by Dianna Hohensee (Inactive) [ 15/Sep/20 ] | ||
|
Ah, I see finally. Thanks, that makes sense. | ||
| Comment by Tess Avitabile (Inactive) [ 15/Sep/20 ] | ||
|
The bug I described is harder to hit without making the optimization, but I believe it still exists today. In my example, I didn't assume that advancing lastDurable would advance lastApplied. | ||
| Comment by Dianna Hohensee (Inactive) [ 15/Sep/20 ] | ||
|
It looks like today we won't forward the lastDurable timestamp if the new value is ahead of lastApplied: that check is done here in MemberData. So I think there isn't a bug right now, but there would be if we tried to move lastApplied forward with lastDurable as an optimization? I think the original premise of this ticket was because of that odd log message in PRIMARY state that we didn't understand at the time. Moving lastApplied forward with lastDurable in state PRIMARY would be a new optimization because of the Replicate Before Journaling changes, but such an optimization is blocked by incorrect things happening in ROLLBACK if it were done for all states right now. | ||
| Comment by Tess Avitabile (Inactive) [ 15/Sep/20 ] | ||
|
Thanks for that analysis! If it is the case that a waitUntilDurable call can span the lifetime of a rollback, then even without the optimization in this ticket, I think we can get into the scenario I outlined above, which might violate internal invariants:
I think this scenario would be a bug (and from what you said, it sounds like a beginning-of-time bug for the JournalFlusher). I think I'll file this as a bug on Execution and mark this ticket as dependent on that one. | ||
| Comment by Dianna Hohensee (Inactive) [ 15/Sep/20 ] | ||
|
So the only scenario I can think of to cause the problem is the JournalFlusher thread simply running asynchronously in non-PRIMARY state: starting waitUntilDurable() and fetching a timestamp via the JournalListener(lastApplied), and then not finishing and not calling JournalListener::onDurable to update lastDurable until after rollback finishes / clears the lastApplied timestamp. I think this is the scenario you proposed back on August 14th. I think I was lost in the weeds of the other bugs and I didn't analyze and answer properly. I don't think what you laid out is a bug, though. It seems like it should always have been the case that this can happen, as long as there's been a JournalFlusher and rollback. The JournalFlusher takes lastApplied, flushes and tries to set lastDurable with it. I wouldn't be surprised if the logic to skip updating lastDurable if ahead of lastApplied was deliberately done to solve the rollback problem you found when not skipping. Looks like you documented this aspect a bit ago in the related ticket, though not because of rollback per se: JournalListener::onDurable led me down to MemberData::setLastDurableOpTimeAndWallTime. Rollback seems like the only scenario where forwarding the lastApplied along with lastDurable is unsafe. PRIMARY state should be be safe, because we aren't manually manipulating the lastApplied value anywhere – that I know of. Secondary batch application should not run into any problems because in that repl state the JournalFlusher only grabs lastApplied to update lastDurable with. The log Kelsey identifies in the first comment
seems to have shown up because PRIMARY state started getting ahead of lastApplied like only rollback used to do. You could theoretically try turninhg the JournalFlusher off for more of rollback: we turn it off briefly around storage engine recovery to a stable timestamp. SECONDARY state wouldn't ever try to move lastDurable ahead of lastApplied; rollback wouldn't try if the JournalFlusher were off, except for during the explicit waitUntilUnjournaledWritesDurable() calls I see in rs_rollback.cpp and replication_recovery.cpp. I'm not familiar with whether we try to forward the lastDurable timestamp in any other code paths than waitUntilDurable(). | ||
| Comment by Tess Avitabile (Inactive) [ 11/Sep/20 ] | ||
|
That's correct, that lastDurableBeingSet would have a greater term than lastApplied but lesser timestamp. This could happen if the node that rolled back was at a later timestamp than the new primary. For example:
This bug could occur if we allow asynchronous threads to set lastDurable while the node is in state SECONDARY. Do you think it's still possible for this to occur? | ||
| Comment by Dianna Hohensee (Inactive) [ 10/Sep/20 ] | ||
|
tess.avitabile, could you explain what that invariant means? It seems like to get into the if-statement, the lastDurableBeingSet, opTimeAndWallTime.opTime, is != lastApplied. I don't get why
passes, but
would ever not pass after that? UPDATE | ||
| Comment by Tess Avitabile (Inactive) [ 09/Sep/20 ] | ||
|
dianna.hohensee, I rebased and tested by patch again, but I encountered the same invariant after rollback. Do you have any guess as to why this might happen? My understanding is that this is unexpected, since after | ||
| Comment by Tess Avitabile (Inactive) [ 31/Aug/20 ] | ||
|
Thanks, dianna.hohensee! I'll reschedule this ticket. | ||
| Comment by Dianna Hohensee (Inactive) [ 28/Aug/20 ] | ||
|
| ||
| Comment by Dianna Hohensee (Inactive) [ 24/Aug/20 ] | ||
|
Yes, I think that's possible, and is the bug scenario. A direct waitUntilDurable call that doesn't go through the JournalFlusher thread can get far enough to fetch a truncatePoint timestamp X (getToken), then stepdown runs, and then try to set the durableTimestamp with that prefetched timestamp X. | ||
| Comment by Tess Avitabile (Inactive) [ 05/May/20 ] | ||
|
Skipping update of lastDurable shouldn't mess up write concern confirmations, or make unusual behavior in what can be read. If I understand the code correctly, j:true writes will call setLastDurableOpTimeAndWallTime(), but they don't wait for lastDurable to advance. This will be a little odd, since durableOpTime in the replSetGetStatus response won't be updated promptly when a j:true write succeeds. But this means there is no performance impact for j:true writes. Since we don't have a readConcern level like "only show journaled data", there's no impact on what can be read. There would be a performance impact for w:"majority" writes, which do wait for the lastDurable to advance on a majority. I think I agree that we should do this on master only. | ||
| Comment by Dianna Hohensee (Inactive) [ 05/May/20 ] | ||
|
Is skipping update of the lastDurable timestamp going to mess up any write concern confirmations, or make unusual behavior in what can be read? waitUntilDurable, which updates lastDurable via the JournalListener, is typically called for j:true writes and otherwise on a periodic cadence by the JournalFlusher. In terms of correctness of oplog truncation via the oplogTruncateAfterPoint, that logic is safe because the oplogTruncateAfterPoint gets updated regardless of what happens with the lastDurable timestamp update attempt. And oplog visibility via oplogReadTimestamp is a separate system. | ||
| Comment by Judah Schvimer [ 05/May/20 ] | ||
|
I think leaving 4.4 as is and making the change on 4.5+, which will have a lot of time to burn in, would be a worthwhile perf gain. | ||
| Comment by Tess Avitabile (Inactive) [ 05/May/20 ] | ||
|
I noticed that we don't set lastDurable forward if it would move ahead of lastApplied, so we maintain that lastDurable is never ahead of lastApplied. This means there's no concern about sending heartbeats that have lastDurable ahead of lastApplied. We could consider changing the code to advance both lastApplied and lastDurable when we try to advance lastDurable ahead of lastApplied, or we could leave the code as is. It's possible there would be a performance benefit to w:"majority" writes if we make the change. My inclination is to leave the code as is, at least on the 4.4 branch. dianna.hohensee, judah.schvimer, siyuan.zhou, what are your thoughts? | ||
| Comment by Tess Avitabile (Inactive) [ 04/May/20 ] | ||
|
Thanks, dianna.hohensee! Yes, the lastApplied optime is advanced through on-commit handling, so it can lag behind what is written in the oplog. Since this is expected, we should probably either (1) Remove the log line, or (2) Update the lastApplied when we update the lastDurable. I think (2) is preferable, since having lastApplied behind lastDurable will lead us to log surprising heartbeat messages, and I'm not sure if we have any other assumptions that lastApplied should be ahead of lastDurable. | ||
| Comment by Dianna Hohensee (Inactive) [ 04/May/20 ] | ||
|
It sounds possible to me that the durable timestamp can move ahead of the last applied timestamp. I am not familiar with how the applied timestamp is updated – presumable some kind of on commit handling? --, but the durable timestamp is being updated independently in mode PRIMARY by asynchronous replicate before journaling logic. Flushing the journal will first fetch the OpTimeAndWallTime of the oplog entry associated with the WT all_durable timestamp (the no oplog holes point in-memory). Then do the journal flush. Then use the previously fetched OpTimeAndWallTime to update the repl durable timestamp. The durable timestamp updates for journaling are dependent only on the WT all_durable timestamp. In mode SECONDARY, the oplog applier thread explicitly triggers journal flushes, but moves on without waiting. | ||
| Comment by Tess Avitabile (Inactive) [ 04/May/20 ] | ||
|
That does sound concerning. It would be great to see logs with increased log level. dianna.hohensee, do you think this could be related to the Replicate Before Journaling project? matthew.russotto, we still expect it to be an error for the lastDurable optime to advance beyond the lastApplied optime, is that right? | ||
| Comment by Kelsey Schubert [ 01/May/20 ] | ||
|
Here's a sample log line it happens every few seconds:
|