Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-56004

Fix rollback_set_fcv.js where primary hangs when setting FCV after reconfig

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 5.0.0-rc2, 5.1.0-rc0
    • Affects Version/s: None
    • Component/s: Replication
    • Labels:
      None
    • Fully Compatible
    • ALL
    • Repl 2021-06-14
    • 18

      We see the following sequence occur when running rollback_set_fcv.js:

      • We see a log message indicating that it has been propagated to a majority of nodes:
        [js_test:rollback_set_fcv] d20781| 2021-03-08T08:40:07.492+00:00 I  -        5042304 [conn2] "The downgraded replica set config has been propagated to a majority"
        
      • It also seems like from heartbeat requests, all nodes have actually updated to this new config of configVersion: 6, configTerm: 2.
        [js_test:rollback_set_fcv] d20781| 2021-03-08T08:40:06.003+00:00 D2 REPL_HB  24095   [conn20] "Received heartbeat request","attr":{"from":"ip-10-122-51-241.ec2.internal:20772","cmdObj":{"replSetHeartbeat":"rollback_set_fcv","configVersion":6,"configTerm":2,"hbv":1,"from":"ip-10-122-51-241.ec2.internal:20772","fromId":2,"term":2,"primaryId":-1,"maxTimeMSOpOnly":10000,"$replData":1,"$clusterTime":{"clusterTime":{"$timestamp":{"t":1615192805,"i":1}},"signature":{"hash":{"$binary":{"base64":"7OtpxE//rC4QyFMzAfGz+yomYA4=","subType":"0"}},"keyId":6937200231259832000}},"$db":"admin"}}
        [js_test:rollback_set_fcv] d20781| 2021-03-08T08:40:06.140+00:00 D2 REPL_HB  24095   [conn5] "Received heartbeat request","attr":{"from":"ip-10-122-51-241.ec2.internal:20770","cmdObj":{"replSetHeartbeat":"rollback_set_fcv","configVersion":6,"configTerm":2,"hbv":1,"from":"ip-10-122-51-241.ec2.internal:20770","fromId":0,"term":2,"maxTimeMSOpOnly":10000,"$replData":1,"$db":"admin"}}
        [js_test:rollback_set_fcv] d20781| 2021-03-08T08:40:07.492+00:00 I  REPL     4508701 [conn2] "The current replica set config is committed","attr":{"configVersion":6,"configTerm":2,"configWaitDurationMillis":2009,"oplogWaitDurationMillis":0,"configOplogCommitmentOpTime":{"ts":{"$timestamp":{"t":1615192804,"i":1}},"t":2}}
        
      • However, on the primary, we only receive heartbeat responses with the new config of configVersion: 6, configTerm: 2 from one of the secondaries:
        [js_test:rollback_set_fcv] d20781| 2021-03-08T08:40:05.658+00:00 D2 REPL_HB  4615620 [ReplCoord-6] "Received response to heartbeat","attr":{"requestId":125,"target":"ip-10-122-51-241.ec2.internal:20770","response":{"ok":1,"state":2,"v":5,"configTerm":2,"set":"rollback_set_fcv","syncingTo":"ip-10-122-51-241.ec2.internal:20771","term":2,"primaryId":1,"durableOpTime":{"ts":{"$timestamp":{"t":1615192805,"i":1}},"t":2},"durableWallTime":{"$date":"2021-03-08T08:40:05.459Z"},"opTime":{"ts":{"$timestamp":{"t":1615192805,"i":1}},"t":2},"wallTime":{"$date":"2021-03-08T08:40:05.459Z"},"electable":true,"$replData":{"term":2,"lastOpCommitted":{"ts":{"$timestamp":{"t":1615192805,"i":1}},"t":2},"lastCommittedWall":{"$date":"2021-03-08T08:40:05.459Z"},"lastOpVisible":{"ts":{"$timestamp":{"t":1615192805,"i":1}},"t":2},"configVersion":5,"configTerm":2,"replicaSetId":{"$oid":"6045e2dbfd9e9af77a0e716b"},"syncSourceIndex":1,"isPrimary":false},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1615192805,"i":1}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"operationTime":{"$timestamp":{"t":1615192805,"i":1}}}}
        
        
        [js_test:rollback_set_fcv] d20781| 2021-03-08T08:40:07.491+00:00 D2 REPL_HB  4615620 [ReplCoord-2] "Received response to heartbeat","attr":{"requestId":127,"target":"ip-10-122-51-241.ec2.internal:20772","response":{"ok":1,"state":2,"v":6,"configTerm":2,"set":"rollback_set_fcv","syncingTo":"ip-10-122-51-241.ec2.internal:20771","term":2,"primaryId":1,"durableOpTime":{"ts":{"$timestamp":{"t":1615192801,"i":2}},"t":2},"durableWallTime":{"$date":"2021-03-08T08:40:01.459Z"},"opTime":{"ts":{"$timestamp":{"t":1615192801,"i":2}},"t":2},"wallTime":{"$date":"2021-03-08T08:40:01.459Z"},"electable":false,"$replData":{"term":2,"lastOpCommitted":{"ts":{"$timestamp":{"t":1615192805,"i":1}},"t":2},"lastCommittedWall":{"$date":"2021-03-08T08:40:05.459Z"},"lastOpVisible":{"ts":{"$timestamp":{"t":1615192801,"i":2}},"t":2},"configVersion":6,"configTerm":2,"replicaSetId":{"$oid":"6045e2dbfd9e9af77a0e716b"},"syncSourceIndex":1,"isPrimary":false},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1615192805,"i":1}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"operationTime":{"$timestamp":{"t":1615192801,"i":2}}}}
        
      • After this, the primary is network partitioned from the other nodes, and attempts to set its FCV to the target version of 4.9.
      • In order to do this, it must first wait for the current replica set config (configVersion: 6, configTerm: 2) to propagate to all nodes. It seems like the primary never updated its local view of one of the secondaries (node with port 20770), so it still thinks it has an old config. Since the primary is partitioned, it is never able to update its view of the other nodes' configs, which causes the hang in awaitReplication, so it is never able to set its FCV to the target version of 4.9.

      One possible solution could be to make sure the initial downgrade reconfig is propagated to all of the nodes in the replica set, rather than just the majority. This way, the primary's view of the other nodes' configs would be up to date before trying to set its FCV.

            Assignee:
            huayu.ouyang@mongodb.com Huayu Ouyang
            Reporter:
            huayu.ouyang@mongodb.com Huayu Ouyang
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: