Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-27403

Consider term and rbid when validating the proposed sync source

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical - P2
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.2.13, 3.5.5
    • Component/s: Replication
    • Labels:
      None
    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Backport Requested:
      v3.4, v3.2
    • Sprint:
      Repl 2017-01-23, Repl 2017-02-13, Repl 2017-03-06, Repl 2017-03-27

      Description

      When the document returned by the GTE query against our sync source does not include our most recent optime (ie the term and timestamp of our most recent oplog entry), we currently unconditionally return OplogStartMissing and go into ROLLBACK. The GTE query only includes the timestamp however, not the term, so we need to check the term and if the optime (including term) of our most recent oplog entry is higher than the optime we got back from the GTE query, we should not go into rollback but should just choose a new sync source.

        Issue Links

          Activity

          Hide
          thomas.schubert Thomas Schubert added a comment -

          Siyuan Zhou, my understanding is that we close all connections except those with the keepOpen property: https://github.com/mongodb/mongo/blob/ba55f25/src/mongo/db/repl/replication_coordinator_impl.cpp#L2590-L2600. Discussed on SERVER-26986.

          Show
          thomas.schubert Thomas Schubert added a comment - Siyuan Zhou , my understanding is that we close all connections except those with the keepOpen property: https://github.com/mongodb/mongo/blob/ba55f25/src/mongo/db/repl/replication_coordinator_impl.cpp#L2590-L2600 . Discussed on SERVER-26986 .
          Hide
          redbeard0531 Mathias Stearn added a comment -

          Siyuan Zhou Once we establish a cursor to the oplog, we can rely on all getMores on that cursor being from the same timeline because a (succesfull) rollback will truncate the oplog which kills all cursors: https://github.com/mongodb/mongo/blob/ef1f1739d6cbff9fb4ddbcc77d467f183c0ab9f2/src/mongo/db/catalog/collection.cpp#L921

          For the purposes of oplog reading we don't care if our upstream node unsuccessfully rolls back as long as it hasn't truncated the oplog.

          Show
          redbeard0531 Mathias Stearn added a comment - Siyuan Zhou Once we establish a cursor to the oplog , we can rely on all getMores on that cursor being from the same timeline because a (succesfull) rollback will truncate the oplog which kills all cursors: https://github.com/mongodb/mongo/blob/ef1f1739d6cbff9fb4ddbcc77d467f183c0ab9f2/src/mongo/db/catalog/collection.cpp#L921 For the purposes of oplog reading we don't care if our upstream node unsuccessfully rolls back as long as it hasn't truncated the oplog.
          Hide
          spencer Spencer T Brody added a comment -

          I think the rbid check is necessary to ensure that our sync source continues to have our minvalid after we check for it. I do think we might be able to eliminate the proposed extra round trip in sync source resolver to get the last applied optime if we followed Siyuan's suggestion and included the lastOpApplied in the metadata and used that in the OplogFetcher::checkRemoteOplogStart() check.

          Show
          spencer Spencer T Brody added a comment - I think the rbid check is necessary to ensure that our sync source continues to have our minvalid after we check for it. I do think we might be able to eliminate the proposed extra round trip in sync source resolver to get the last applied optime if we followed Siyuan's suggestion and included the lastOpApplied in the metadata and used that in the OplogFetcher::checkRemoteOplogStart() check.
          Hide
          xgen-internal-githook Githook User added a comment -

          Author:

          {u'username': u'judahschvimer', u'name': u'Judah Schvimer', u'email': u'judah@mongodb.com'}

          Message: SERVER-27403 Ensure sync source is ahead and has not rolled back after first OplogFetcher batch
          Branch: master
          https://github.com/mongodb/mongo/commit/c05f900dd80342d0899f6461f845dc97fe942b01

          Show
          xgen-internal-githook Githook User added a comment - Author: {u'username': u'judahschvimer', u'name': u'Judah Schvimer', u'email': u'judah@mongodb.com'} Message: SERVER-27403 Ensure sync source is ahead and has not rolled back after first OplogFetcher batch Branch: master https://github.com/mongodb/mongo/commit/c05f900dd80342d0899f6461f845dc97fe942b01
          Hide
          xgen-internal-githook Githook User added a comment -

          Author:

          {u'username': u'judahschvimer', u'name': u'Judah Schvimer', u'email': u'judah@mongodb.com'}

          Message: SERVER-27403 SERVER-28278 Ensure sync source is ahead and not rolled back after first fetcher batch
          Branch: v3.2
          https://github.com/mongodb/mongo/commit/45cc6d20a413d88fc49f6dac257f800fda926be6

          Show
          xgen-internal-githook Githook User added a comment - Author: {u'username': u'judahschvimer', u'name': u'Judah Schvimer', u'email': u'judah@mongodb.com'} Message: SERVER-27403 SERVER-28278 Ensure sync source is ahead and not rolled back after first fetcher batch Branch: v3.2 https://github.com/mongodb/mongo/commit/45cc6d20a413d88fc49f6dac257f800fda926be6

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                  Agile