[SERVER-27403] Consider term and rbid when validating the proposed sync source Created: 13/Dec/16 Updated: 29/Aug/18 Resolved: 08/Mar/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 3.2.13, 3.4.4, 3.5.5 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Spencer Brody (Inactive) | Assignee: | Judah Schvimer |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Backport Requested: |
v3.4, v3.2
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | Repl 2017-01-23, Repl 2017-02-13, Repl 2017-03-06, Repl 2017-03-27 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Linked BF Score: | 19 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
When the document returned by the GTE query against our sync source does not include our most recent optime (ie the term and timestamp of our most recent oplog entry), we currently unconditionally return OplogStartMissing and go into ROLLBACK. The GTE query only includes the timestamp however, not the term, so we need to check the term and if the optime (including term) of our most recent oplog entry is higher than the optime we got back from the GTE query, we should not go into rollback but should just choose a new sync source. |
| Comments |
| Comment by Githook User [ 06/Apr/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'judahschvimer', u'name': u'Judah Schvimer', u'email': u'judah@mongodb.com'}Message: (cherry picked from commit c05f900dd80342d0899f6461f845dc97fe942b01) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 24/Mar/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'judahschvimer', u'name': u'Judah Schvimer', u'email': u'judah@mongodb.com'}Message: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 08/Mar/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'judahschvimer', u'name': u'Judah Schvimer', u'email': u'judah@mongodb.com'}Message: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Spencer Brody (Inactive) [ 09/Jan/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I think the rbid check is necessary to ensure that our sync source continues to have our minvalid after we check for it. I do think we might be able to eliminate the proposed extra round trip in sync source resolver to get the last applied optime if we followed Siyuan's suggestion and included the lastOpApplied in the metadata and used that in the OplogFetcher::checkRemoteOplogStart() check. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mathias Stearn [ 07/Jan/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
siyuan.zhou Once we establish a cursor to the oplog, we can rely on all getMores on that cursor being from the same timeline because a (succesfull) rollback will truncate the oplog which kills all cursors: https://github.com/mongodb/mongo/blob/ef1f1739d6cbff9fb4ddbcc77d467f183c0ab9f2/src/mongo/db/catalog/collection.cpp#L921 For the purposes of oplog reading we don't care if our upstream node unsuccessfully rolls back as long as it hasn't truncated the oplog. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kelsey Schubert [ 06/Jan/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
siyuan.zhou, my understanding is that we close all connections except those with the keepOpen property: https://github.com/mongodb/mongo/blob/ba55f25/src/mongo/db/repl/replication_coordinator_impl.cpp#L2590-L2600. Discussed on | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Siyuan Zhou [ 06/Jan/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I think we close all connections on rollback. Closing connections (more importantly cursors) on rollback ensures that data fetched by consecutive query/getMore requests is continuous. Changing the semantic of lastVisible may be a trouble for multiversion. It's easier to add the last applied optime to metadata for rollback detection. If the new field is absent, we can fall back to the current behavior for back-compatibility in multiversion replset. Once the user finishes upgrade, the new correct logic works on all nodes. The benefit is that all rollback detection logic is at the same place. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Mathias Stearn [ 30/Dec/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
siyuan.zhou, we don't actually close all connections on rollback. That is why we need to use the RBID (carefully!) to ensure that rollbacks haven't happened in critical sections. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Siyuan Zhou [ 30/Dec/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
As discussed with judah.schvimer, the lastVisible is not the last applied optime. It’s the max of lastOpTimeOnClient and committed snapshot optime. The discussion of lastVisible is tracked in | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Siyuan Zhou [ 30/Dec/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Judah's proposal sounds correct given that we close connections on rollback (so RBID serves as a version). I think there might be an easier fix though. If GTE response starts from our last fetched optime, it means all the oplog entries up to our last fetched document are the same as those on our sync source, due to Log Matching property in Raft: if two logs contain an entry with the same index and term, then the logs are identical in all entries up through the given index. In addition, we need to compare the last applied optime on sync source and make sure it's ahead of our last fetched optime. This requirement is the same in Raft. Actually the last applied optime on sync source is ready to use in ReplSetMetadata since we always query it in oplog fetcher. For each GTE response, we make sure If Oplog Matching is false but Oplog Freshness is true, we have to roll back. As a result, the correctness is guaranteed by these two conditions, even if our sync source rolls back, and choosing sync source is only for performance. Maintaining RBID is not necessary in steady state replication then. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Judah Schvimer [ 29/Dec/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Per our discussion we will be doing the following: As part of Finally, OplogFetcher::checkRemoteOplogStart() is now mostly fine. Since our sync source cannot have rolled back since we chose it, it appears impossible for our GTE query to return empty documents. We will add an fassert there instead of returning RemoteOplogStale. The OplogStartMissing cases should be fine now. If we fall off the back of the sync source's oplog between probing the sync source and getting our first batch, we will return an OplogStartMissing error here and go into rollback, leading to an inevitable UnrecoverableRollbackException since we will not be able to find a common point. That is a small window and users rarely fall of the back of their oplog. Further work could check if this is the case by re-querying the sync source's oldest oplog entry before going into rollback, but that is out of the scope of this ticket. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Spencer Brody (Inactive) [ 20/Dec/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
You're right, that is a problem. For reference, the original problem this ticket is trying to avoid is node A or B rolling back in the following situation
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Siyuan Zhou [ 19/Dec/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
That's where our implementation differs from Raft. Even if a majority of nodes know about term 2, they can still replicate logs from the old primary or nodes that synced from the old primary. Even in Raft, I still think this is possible. At the beginning, Node C steps up in term 2.
Then Node A steps up again in term 3.
Node A replicates and commits logs from old terms (term 1).
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Spencer Brody (Inactive) [ 19/Dec/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The scenario you describe isn't possible. For node C to have been elected in term 2, a majority of nodes must have voted for it, and thus a majority of nodes must know about term 2, and all future committed writes must happen in a term greater than or equal to 2. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Siyuan Zhou [ 19/Dec/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Should we compare our last optime against the last applied optime on our sync source rather than the first optime from the response of GLE? Imagine the following scenario, timestamps are shown as integers for simplicity. When Node C queries the oplog on Node B to find anything greater than or equal to (ts: 3), it will get (ts: 3, term 1), whose term is less than its last term in oplog (ts: 3, term 2). But Node B is actually ahead of Node C.
Having the last applied optime in GLE metadata of response and checking it against its last fetched optime is a robust way to detect rollback. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Spencer Brody (Inactive) [ 13/Dec/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I think what we should do is break out the case where we check if the oplog entry returned by our sync source matches our most recent oplog entries into 3 cases:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Eric Milkie [ 13/Dec/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I suspect that changing the query this way may break oplogReplay, unfortunately. Our only other option is to do a local round of filtering after we get the data back. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Spencer Brody (Inactive) [ 13/Dec/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I believe we could do this with a change in the oplog query we do, as milkie suggests, by changing the query from
to
i.e. find the ops where either the term is greater than my last op, or the term is the same and the timestamp is greater. If we take this approach, then implementing My bigger concern with this approach is if it will negatively impact the performance of the oplog queries. david.storch, can you comment on what affect changing the query we use to fetch oplog entries for replication is likely to have? This is the main query we use for fetching oplog entries, which runs with tailable, oplogReplay, and awaitData all set to true. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Eric Milkie [ 13/Dec/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I would prefer, if possible, to construct the query such that data isn't returned if it is less than what we are looking for. As proposed, we would do a GTE query which would leave out some, but not all, of the documents we care about not seeing, and then after getting the data we would then do another filtering pass ourselves. |