[SERVER-50140] Initial sync cannot survive unclean restart of the sync source Created: 06/Aug/20  Updated: 29/Oct/23  Resolved: 12/Aug/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 4.4.0
Fix Version/s: 4.4.1, 4.7.0

Type: Bug Priority: Critical - P2
Reporter: Suganthi Mani Assignee: Lingzhi Deng
Resolution: Fixed Votes: 0
Labels: KP44
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
is depended on by SERVER-49159 Return NotPrimaryOrSecondary if curre... Closed
Related
related to SERVER-50251 Re-enable test added in SERVER-50140 ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4
Sprint: Repl 2020-08-24
Participants:

 Description   

Investigate the behavior of resumable initial sync when sync source restarts after unclean shutdown.



 Comments   
Comment by Githook User [ 13/Aug/20 ]

Author:

{'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}

Message: SERVER-50140: Initial sync cannot survive unclean restart of the sync source

(cherry picked from commit 06169f718a7aec04f952979ffa6590e4334dea5a)
(cherry picked from commit d598744856190df68f36c3f7a88decd30fa8e912)
Branch: v4.4
https://github.com/mongodb/mongo/commit/9b68f8ceab4752337688b8feebf23976094bac0f

Comment by Githook User [ 12/Aug/20 ]

Author:

{'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}

Message: SERVER-50140: Initial sync cannot survive unclean restart of the sync source
Branch: master
https://github.com/mongodb/mongo/commit/d598744856190df68f36c3f7a88decd30fa8e912

Comment by Suganthi Mani [ 07/Aug/20 ]

Just adding some additional points to Eric, to my understanding, after durable history project, both repl startup and rollback state will be executing the WT method rollbackTostable().

Comment by Eric Milkie [ 07/Aug/20 ]

One concern with incrementing the rollback ID would be that it would make metrics no longer accurate about how many replication rollbacks we had.

I'm not really sure why people would care whether a rollback was "replication rollback" or not. The method used to do rollback is almost the same – recover-to-timestamp is used both for replication rollback and at startup. So in some sense, it would be more accurate to count unclean shutdowns as a "rollback".

Indeed, after we switched table logging to be almost all "unlogged", that had the effect of converting a lot of situations that would have been "replication rollback" into ones where the rollback occurred at startup recovery instead, so the metrics are already biased as of that change.

Comment by Judah Schvimer [ 07/Aug/20 ]

AFAIK, the rollback ID is only used for initial sync, sync source selection post rollback via refetch, and downstream implementations of initial sync.

One concern with incrementing the rollback ID would be that it would make metrics no longer accurate about how many replication rollbacks we had. We could fix that concern by adding a metric specifically for replication rollback counting.

I would expect downstream implementations of initial sync to want the same fix as this bug.

Sync source selection post rollback via refetch would also want the same fix. That uses the rollback ID to confirm that a potential sync source still has a node's minValid oplog entry. Since this is a way for a potential sync source to lose a minValid oplog entry, it would make sense to bump the rollback ID in this case.

Comment by Suganthi Mani [ 07/Aug/20 ]

Even if it's not rollback ID, we can have separate id to track the unclean shutdown. But, I am also curious to know how we can distinguish between clean shutdown-restart and unclean shutdown-restart? Is it something by using this logic? or is it possible to useĀ  "oplogTruncateAfterPoint" to get that information as this field will be cleared out during the clean shutdown on primary and this will be set on secondaries only if the secondaries crashes while writing oplog entries for a batch.

Comment by Matthew Russotto [ 07/Aug/20 ]

Does it make sense to have a node increment its rollback ID during restart from unclean shutdown? Since it could have effectively "rolled back" even if no formal rollback happened? Or would that have undesirable side effects?

Comment by Eric Milkie [ 06/Aug/20 ]

It sounds like this situation would be hard to hit in practice, since you would need to have a primary node undergo an unclean shutdown, restart, and then NOT undergo any rollback.

Generated at Thu Feb 08 05:21:50 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.