Uploaded image for project: 'Documentation'
  1. Documentation
  2. DOCS-14370

Investigate changes in SERVER-56004: Fix bug where primary hangs when setting FCV after reconfig

      Description

      Downstream Change Summary

      We should document that users should not attempt an upgrade or a downgrade if any member of the replica set is down.

      Description of Linked Ticket

      We see the following sequence occur when running rollback_set_fcv.js:

      • We see a log message indicating that it has been propagated to a majority of nodes:
        [js_test:rollback_set_fcv] d20781| 2021-03-08T08:40:07.492+00:00 I  -        5042304 [conn2] "The downgraded replica set config has been propagated to a majority"
        
      • It also seems like from heartbeat requests, all nodes have actually updated to this new config of configVersion: 6, configTerm: 2.
        [js_test:rollback_set_fcv] d20781| 2021-03-08T08:40:06.003+00:00 D2 REPL_HB  24095   [conn20] "Received heartbeat request","attr":{"from":"ip-10-122-51-241.ec2.internal:20772","cmdObj":{"replSetHeartbeat":"rollback_set_fcv","configVersion":6,"configTerm":2,"hbv":1,"from":"ip-10-122-51-241.ec2.internal:20772","fromId":2,"term":2,"primaryId":-1,"maxTimeMSOpOnly":10000,"$replData":1,"$clusterTime":{"clusterTime":{"$timestamp":{"t":1615192805,"i":1}},"signature":{"hash":{"$binary":{"base64":"7OtpxE//rC4QyFMzAfGz+yomYA4=","subType":"0"}},"keyId":6937200231259832000}},"$db":"admin"}}
        [js_test:rollback_set_fcv] d20781| 2021-03-08T08:40:06.140+00:00 D2 REPL_HB  24095   [conn5] "Received heartbeat request","attr":{"from":"ip-10-122-51-241.ec2.internal:20770","cmdObj":{"replSetHeartbeat":"rollback_set_fcv","configVersion":6,"configTerm":2,"hbv":1,"from":"ip-10-122-51-241.ec2.internal:20770","fromId":0,"term":2,"maxTimeMSOpOnly":10000,"$replData":1,"$db":"admin"}}
        [js_test:rollback_set_fcv] d20781| 2021-03-08T08:40:07.492+00:00 I  REPL     4508701 [conn2] "The current replica set config is committed","attr":{"configVersion":6,"configTerm":2,"configWaitDurationMillis":2009,"oplogWaitDurationMillis":0,"configOplogCommitmentOpTime":{"ts":{"$timestamp":{"t":1615192804,"i":1}},"t":2}}
        
      • However, on the primary, we only receive heartbeat responses with the new config of configVersion: 6, configTerm: 2 from one of the secondaries:
        [js_test:rollback_set_fcv] d20781| 2021-03-08T08:40:05.658+00:00 D2 REPL_HB  4615620 [ReplCoord-6] "Received response to heartbeat","attr":{"requestId":125,"target":"ip-10-122-51-241.ec2.internal:20770","response":{"ok":1,"state":2,"v":5,"configTerm":2,"set":"rollback_set_fcv","syncingTo":"ip-10-122-51-241.ec2.internal:20771","term":2,"primaryId":1,"durableOpTime":{"ts":{"$timestamp":{"t":1615192805,"i":1}},"t":2},"durableWallTime":{"$date":"2021-03-08T08:40:05.459Z"},"opTime":{"ts":{"$timestamp":{"t":1615192805,"i":1}},"t":2},"wallTime":{"$date":"2021-03-08T08:40:05.459Z"},"electable":true,"$replData":{"term":2,"lastOpCommitted":{"ts":{"$timestamp":{"t":1615192805,"i":1}},"t":2},"lastCommittedWall":{"$date":"2021-03-08T08:40:05.459Z"},"lastOpVisible":{"ts":{"$timestamp":{"t":1615192805,"i":1}},"t":2},"configVersion":5,"configTerm":2,"replicaSetId":{"$oid":"6045e2dbfd9e9af77a0e716b"},"syncSourceIndex":1,"isPrimary":false},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1615192805,"i":1}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"operationTime":{"$timestamp":{"t":1615192805,"i":1}}}}
        
        
        [js_test:rollback_set_fcv] d20781| 2021-03-08T08:40:07.491+00:00 D2 REPL_HB  4615620 [ReplCoord-2] "Received response to heartbeat","attr":{"requestId":127,"target":"ip-10-122-51-241.ec2.internal:20772","response":{"ok":1,"state":2,"v":6,"configTerm":2,"set":"rollback_set_fcv","syncingTo":"ip-10-122-51-241.ec2.internal:20771","term":2,"primaryId":1,"durableOpTime":{"ts":{"$timestamp":{"t":1615192801,"i":2}},"t":2},"durableWallTime":{"$date":"2021-03-08T08:40:01.459Z"},"opTime":{"ts":{"$timestamp":{"t":1615192801,"i":2}},"t":2},"wallTime":{"$date":"2021-03-08T08:40:01.459Z"},"electable":false,"$replData":{"term":2,"lastOpCommitted":{"ts":{"$timestamp":{"t":1615192805,"i":1}},"t":2},"lastCommittedWall":{"$date":"2021-03-08T08:40:05.459Z"},"lastOpVisible":{"ts":{"$timestamp":{"t":1615192801,"i":2}},"t":2},"configVersion":6,"configTerm":2,"replicaSetId":{"$oid":"6045e2dbfd9e9af77a0e716b"},"syncSourceIndex":1,"isPrimary":false},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1615192805,"i":1}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"operationTime":{"$timestamp":{"t":1615192801,"i":2}}}}
        
      • After this, the primary is network partitioned from the other nodes, and attempts to set its FCV to the target version of 4.9.
      • In order to do this, it must first wait for the current replica set config (configVersion: 6, configTerm: 2) to propagate to all nodes. It seems like the primary never updated its local view of one of the secondaries (node with port 20770), so it still thinks it has an old config. Since the primary is partitioned, it is never able to update its view of the other nodes' configs, which causes the hang in awaitReplication, so it is never able to set its FCV to the target version of 4.9.

      One possible solution could be to make sure the initial downgrade reconfig is propagated to all of the nodes in the replica set, rather than just the majority. This way, the primary's view of the other nodes' configs would be up to date before trying to set its FCV.

      Scope of changes

      Impact to Other Docs

      MVP (Work and Date)

      Resources (Scope or Design Docs, Invision, etc.)

            Assignee:
            jason.price@mongodb.com Jason Price
            Reporter:
            backlog-server-pm Backlog - Core Eng Program Management Team
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved:
              2 years, 17 weeks, 6 days ago