[SERVER-27403] Consider term and rbid when validating the proposed sync source Created: 13/Dec/16  Updated: 29/Aug/18  Resolved: 08/Mar/17

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 3.2.13, 3.4.4, 3.5.5

Type: Bug Priority: Critical - P2
Reporter: Spencer T Brody Assignee: Judah Schvimer
Resolution: Fixed Votes: 0
Labels: None

Issue Links:
depends on SERVER-27050 Ensure upstream node doesn't roll bac... Closed
depends on SERVER-27543 Create new metadata for oplog queries Closed
is duplicated by SERVER-19753 Do not allow force sync from stale so... Closed
is duplicated by SERVER-26253 Get new sync source rather than rollb... Closed
is duplicated by SERVER-27672 Unnecessary rollback for a hidden nod... Closed
is duplicated by SERVER-27980 Secondary tries to rollback when it l... Closed
related to SERVER-27149 Sync source selection doesn't conside... Closed
related to SERVER-26253 Get new sync source rather than rollb... Closed
related to SERVER-28377 Do not check that remote last applied... Closed
related to SERVER-28686 rollback_after_sync_source_selection.... Closed
related to SERVER-28430 Expose dropConnections() method on Co... Closed
is related to SERVER-28068 Do not go into rollback due to fallin... Open
is related to SERVER-27547 Adjust the system time . then MongoDB... Closed
is related to SERVER-28278 Wait for desired sync source to repli... Closed
is related to SERVER-27545 Include RBID in replSet metadata of c... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v3.4, v3.2
Epic Link: Improved support for Rollback for 3.4
Sprint: Repl 2017-01-23, Repl 2017-02-13, Repl 2017-03-06, Repl 2017-03-27
Linked BF Score: 0


When the document returned by the GTE query against our sync source does not include our most recent optime (ie the term and timestamp of our most recent oplog entry), we currently unconditionally return OplogStartMissing and go into ROLLBACK. The GTE query only includes the timestamp however, not the term, so we need to check the term and if the optime (including term) of our most recent oplog entry is higher than the optime we got back from the GTE query, we should not go into rollback but should just choose a new sync source.

Comment by Eric Milkie [ 13/Dec/16 ]

I would prefer, if possible, to construct the query such that data isn't returned if it is less than what we are looking for. As proposed, we would do a GTE query which would leave out some, but not all, of the documents we care about not seeing, and then after getting the data we would then do another filtering pass ourselves.

Comment by Spencer T Brody [ 13/Dec/16 ]

I believe we could do this with a change in the oplog query we do, as Eric Milkie suggests, by changing the query from

{ts: {$gte: op.timestamp}}


{$or: [{t: {$gt: op.t}}, {$and: [{t: op.t}, {ts: {$gte: op.ts}}]}]}

i.e. find the ops where either the term is greater than my last op, or the term is the same and the timestamp is greater.

If we take this approach, then implementing SERVER-26253 becomes mandatory as well or we will still trigger rollback in these situations.

My bigger concern with this approach is if it will negatively impact the performance of the oplog queries. David Storch, can you comment on what affect changing the query we use to fetch oplog entries for replication is likely to have? This is the main query we use for fetching oplog entries, which runs with tailable, oplogReplay, and awaitData all set to true.

Comment by Eric Milkie [ 13/Dec/16 ]

I suspect that changing the query this way may break oplogReplay, unfortunately. Our only other option is to do a local round of filtering after we get the data back.

Comment by Spencer T Brody [ 13/Dec/16 ]

I think what we should do is break out the case where we check if the oplog entry returned by our sync source matches our most recent oplog entries into 3 cases:

  1. If the sync source's entry is newer, return OplogStartMissing and go into rollback as we currently do.
  2. If our entry is newer, we return RemoteOplogStale, and then we update bgsync to not trigger rollback on RemoteOplogStale, and to just change sync sources instead (which is basically implementing SERVER-26253).
  3. If the two entries have the same OpTime but different hashes, fassert since that should be impossible.
Comment by Siyuan Zhou [ 19/Dec/16 ]

Should we compare our last optime against the last applied optime on our sync source rather than the first optime from the response of GLE?

Imagine the following scenario, timestamps are shown as integers for simplicity. When Node C queries the oplog on Node B to find anything greater than or equal to (ts: 3), it will get (ts: 3, term 1), whose term is less than its last term in oplog (ts: 3, term 2). But Node B is actually ahead of Node C.

Timestamp 1 2 3 4 5
Node A (term) 1 1 1 1 3
Node B (term) 1 1 1 1 3
Node C (term) 1 1 2    

Having the last applied optime in GLE metadata of response and checking it against its last fetched optime is a robust way to detect rollback.

Comment by Spencer T Brody [ 19/Dec/16 ]

The scenario you describe isn't possible. For node C to have been elected in term 2, a majority of nodes must have voted for it, and thus a majority of nodes must know about term 2, and all future committed writes must happen in a term greater than or equal to 2.

Comment by Siyuan Zhou [ 19/Dec/16 ]

That's where our implementation differs from Raft. Even if a majority of nodes know about term 2, they can still replicate logs from the old primary or nodes that synced from the old primary.

Even in Raft, I still think this is possible. At the beginning, Node C steps up in term 2.

Timestamp 1 2 3 4 5
Node A (term) 1 1 1 1  
Node B (term) 1 1      
Node C (term) 1 1 2    

Then Node A steps up again in term 3.

Timestamp 1 2 3 4 5
Node A (term) 1 1 1 1 3
Node B (term) 1 1      
Node C (term) 1 1 2    

Node A replicates and commits logs from old terms (term 1).

Timestamp 1 2 3 4 5
Node A (term) 1 1 1 1 3
Node B (term) 1 1 1 1 3
Node C (term) 1 1 2    
Comment by Spencer T Brody [ 20/Dec/16 ]

You're right, that is a problem.

For reference, the original problem this ticket is trying to avoid is node A or B rolling back in the following situation

Timestamp 1 2 3 4 5 6
Node A (term) 1 1 1 1 3  
Node B (term) 1 1 1 1 3  
Node C (term) 1 1 2 2 2 2
Comment by Judah Schvimer [ 29/Dec/16 ]

Per our discussion we will be doing the following:
The SyncSourceResolver will now make sure that our sync source is safe to use by both checking the Rollback ID (RBID) of the sync source candidate and by looking at its last applied OpTime. The SyncSourceResolver will now:
1. Choose a sync source as it currently does. The sync source selection algorithm will make sure that we only choose sync sources that are ahead of our last applied OpTime.
2. Get the sync source candidate's RBID and save it for later. Work on SERVER-27050 will do this, however it does it only if there is a minValid present, this ticket must make it an unconditional check.
3. Get sync source candidate's last applied OpTime and check to see it's greater than or equal to our last fetched OpTime. (This must be added by this ticket)
4. Get the sync source candidate's oldest oplog entry and see if we are even able to sync from this sync source.
5. Finally, Check if the sync source candidate contains minValid if one exists.

As part of SERVER-27050, the RBID will be checked again in BackgroundSync before the sync source is used. This check should be moved by this ticket into OplogFetcher::checkRemoteOplogStart() so that it is performed on the first batch only and before we decide to go into Rollback.

Finally, OplogFetcher::checkRemoteOplogStart() is now mostly fine. Since our sync source cannot have rolled back since we chose it, it appears impossible for our GTE query to return empty documents. We will add an fassert there instead of returning RemoteOplogStale. The OplogStartMissing cases should be fine now. If we fall off the back of the sync source's oplog between probing the sync source and getting our first batch, we will return an OplogStartMissing error here and go into rollback, leading to an inevitable UnrecoverableRollbackException since we will not be able to find a common point. That is a small window and users rarely fall of the back of their oplog. Further work could check if this is the case by re-querying the sync source's oldest oplog entry before going into rollback, but that is out of the scope of this ticket.

Comment by Siyuan Zhou [ 30/Dec/16 ]

Judah's proposal sounds correct given that we close connections on rollback (so RBID serves as a version). I think there might be an easier fix though. If GTE response starts from our last fetched optime, it means all the oplog entries up to our last fetched document are the same as those on our sync source, due to Log Matching property in Raft:

if two logs contain an entry with the same index and term, then the logs are identical in all entries up through the given index.

In addition, we need to compare the last applied optime on sync source and make sure it's ahead of our last fetched optime. This requirement is the same in Raft. Actually the last applied optime on sync source is ready to use in ReplSetMetadata since we always query it in oplog fetcher.

For each GTE response, we make sure
1) (Oplog Matching) its first entry is the same as our last fetched, and
2) (Oplog Freshness) their last applied (ReplSetMetadata::getLastOpVisible()) is greater than our last fetched one.

If Oplog Matching is false but Oplog Freshness is true, we have to roll back.
If Oplog Freshness is false, choose another sync source.

As a result, the correctness is guaranteed by these two conditions, even if our sync source rolls back, and choosing sync source is only for performance. Maintaining RBID is not necessary in steady state replication then.

Comment by Siyuan Zhou [ 30/Dec/16 ]

As discussed with Judah Schvimer, the lastVisible is not the last applied optime. It’s the max of lastOpTimeOnClient and committed snapshot optime. The discussion of lastVisible is tracked in SERVER-27543. Changing and using lastVisible needs extra work due to multiversion support.

Comment by Mathias Stearn [ 30/Dec/16 ]

Siyuan Zhou, we don't actually close all connections on rollback. That is why we need to use the RBID (carefully!) to ensure that rollbacks haven't happened in critical sections.

Comment by Siyuan Zhou [ 06/Jan/17 ]

I think we close all connections on rollback. Closing connections (more importantly cursors) on rollback ensures that data fetched by consecutive query/getMore requests is continuous.

Changing the semantic of lastVisible may be a trouble for multiversion. It's easier to add the last applied optime to metadata for rollback detection. If the new field is absent, we can fall back to the current behavior for back-compatibility in multiversion replset. Once the user finishes upgrade, the new correct logic works on all nodes.

The benefit is that all rollback detection logic is at the same place.

Comment by Kelsey T Schubert [ 06/Jan/17 ]

Siyuan Zhou, my understanding is that we close all connections except those with the keepOpen property: https://github.com/mongodb/mongo/blob/ba55f25/src/mongo/db/repl/replication_coordinator_impl.cpp#L2590-L2600. Discussed on SERVER-26986.

Comment by Mathias Stearn [ 07/Jan/17 ]

Siyuan Zhou Once we establish a cursor to the oplog, we can rely on all getMores on that cursor being from the same timeline because a (succesfull) rollback will truncate the oplog which kills all cursors: https://github.com/mongodb/mongo/blob/ef1f1739d6cbff9fb4ddbcc77d467f183c0ab9f2/src/mongo/db/catalog/collection.cpp#L921

For the purposes of oplog reading we don't care if our upstream node unsuccessfully rolls back as long as it hasn't truncated the oplog.

Comment by Spencer T Brody [ 09/Jan/17 ]

I think the rbid check is necessary to ensure that our sync source continues to have our minvalid after we check for it. I do think we might be able to eliminate the proposed extra round trip in sync source resolver to get the last applied optime if we followed Siyuan's suggestion and included the lastOpApplied in the metadata and used that in the OplogFetcher::checkRemoteOplogStart() check.

Comment by Githook User [ 08/Mar/17 ]


{u'username': u'judahschvimer', u'name': u'Judah Schvimer', u'email': u'judah@mongodb.com'}

Message: SERVER-27403 Ensure sync source is ahead and has not rolled back after first OplogFetcher batch
Branch: master

Comment by Githook User [ 24/Mar/17 ]


{u'username': u'judahschvimer', u'name': u'Judah Schvimer', u'email': u'judah@mongodb.com'}

Message: SERVER-27403 SERVER-28278 Ensure sync source is ahead and not rolled back after first fetcher batch
Branch: v3.2

Comment by Githook User [ 06/Apr/17 ]


{u'username': u'judahschvimer', u'name': u'Judah Schvimer', u'email': u'judah@mongodb.com'}

Message: SERVER-27403 Ensure sync source is ahead and has not rolled back after first OplogFetcher batch

(cherry picked from commit c05f900dd80342d0899f6461f845dc97fe942b01)
Branch: v3.4

Generated at Sun Jan 20 15:38:22 UTC 2019 using Jira 7.12.1#712002-sha1:609a50578ba6bc73dbf8b05dddd7c04a04b6807c.