Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-27980

Secondary tries to rollback when it lagged to much behind primary

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Duplicate
    • Affects Version/s: 3.4.1
    • Fix Version/s: None
    • Component/s: Replication
    • Labels:
      None
    • Operating System:
      ALL
    • Steps To Reproduce:
      Hide

      To reproduce this situation I used modified version of zbigMapReduce.js test. I changed specified oplog size from 50MB to 2MB. Then I executed test this way:

      python buildscripts/resmoke.py --dbpathPrefix=~/ramd --executor=sharding --jobs=4 --storageEngine=wiredTiger jstests/sharding/zbigMapReduce.js
      

      where ~/ramd is 4GB tmpfs partition.

      Show
      To reproduce this situation I used modified version of zbigMapReduce.js test. I changed specified oplog size from 50MB to 2MB. Then I executed test this way: python buildscripts/resmoke.py --dbpathPrefix=~/ramd --executor=sharding --jobs=4 --storageEngine=wiredTiger jstests/sharding/zbigMapReduce.js where ~/ramd is 4GB tmpfs partition.

      Description

      In replication scenario when there is a lot of inserts on the primary node secondary is not always able to replicate all inserts in time. Documentation on RECOVERING state clearly says that in this situation secondary should transition to RECOVERING mode and manual intervention is required:

      Due to overload, a secondary may fall far enough behind the other members of the replica set such that it may need to resync with the rest of the set. When this happens, the member enters the RECOVERING state and requires manual intervention.

      But in reality secondary node sometimes tries to rollback in this situation leading to failed rollback and wrong node state.

      I believe the problem is in BackgroundSync::_produce function in src/mongo/db/repl/bgsync.cpp. When OplogStartMissing status code is returned by _syncSourceResolver this function correctly transition node to RECOVERING state. But when the same status code is returned by oplogFetcher the code executes rollback without going to RECOVERING state. I think oplog fetcher can return OplogStartMissing }} when the rollback is necessary or when secondary fall far behind primary. So there should be additional check if rollback is necessary or secondary should go to {{RECOVERING state.

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                10 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: