[SERVER-56004] Fix rollback_set_fcv.js where primary hangs when setting FCV after reconfig Created: 09/Apr/21  Updated: 29/Oct/23  Resolved: 10/Jun/21

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 5.0.0-rc2, 5.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Huayu Ouyang Assignee: Huayu Ouyang
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Documented
is documented by DOCS-14370 Investigate changes in SERVER-56004: ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Repl 2021-06-14
Participants:
Linked BF Score: 18

 Description   

We see the following sequence occur when running rollback_set_fcv.js:

  • We see a log message indicating that it has been propagated to a majority of nodes:

    [js_test:rollback_set_fcv] d20781| 2021-03-08T08:40:07.492+00:00 I  -        5042304 [conn2] "The downgraded replica set config has been propagated to a majority"
    

  • It also seems like from heartbeat requests, all nodes have actually updated to this new config of configVersion: 6, configTerm: 2.

    [js_test:rollback_set_fcv] d20781| 2021-03-08T08:40:06.003+00:00 D2 REPL_HB  24095   [conn20] "Received heartbeat request","attr":{"from":"ip-10-122-51-241.ec2.internal:20772","cmdObj":{"replSetHeartbeat":"rollback_set_fcv","configVersion":6,"configTerm":2,"hbv":1,"from":"ip-10-122-51-241.ec2.internal:20772","fromId":2,"term":2,"primaryId":-1,"maxTimeMSOpOnly":10000,"$replData":1,"$clusterTime":{"clusterTime":{"$timestamp":{"t":1615192805,"i":1}},"signature":{"hash":{"$binary":{"base64":"7OtpxE//rC4QyFMzAfGz+yomYA4=","subType":"0"}},"keyId":6937200231259832000}},"$db":"admin"}}
    [js_test:rollback_set_fcv] d20781| 2021-03-08T08:40:06.140+00:00 D2 REPL_HB  24095   [conn5] "Received heartbeat request","attr":{"from":"ip-10-122-51-241.ec2.internal:20770","cmdObj":{"replSetHeartbeat":"rollback_set_fcv","configVersion":6,"configTerm":2,"hbv":1,"from":"ip-10-122-51-241.ec2.internal:20770","fromId":0,"term":2,"maxTimeMSOpOnly":10000,"$replData":1,"$db":"admin"}}
    [js_test:rollback_set_fcv] d20781| 2021-03-08T08:40:07.492+00:00 I  REPL     4508701 [conn2] "The current replica set config is committed","attr":{"configVersion":6,"configTerm":2,"configWaitDurationMillis":2009,"oplogWaitDurationMillis":0,"configOplogCommitmentOpTime":{"ts":{"$timestamp":{"t":1615192804,"i":1}},"t":2}}
    

  • However, on the primary, we only receive heartbeat responses with the new config of configVersion: 6, configTerm: 2 from one of the secondaries:

    [js_test:rollback_set_fcv] d20781| 2021-03-08T08:40:05.658+00:00 D2 REPL_HB  4615620 [ReplCoord-6] "Received response to heartbeat","attr":{"requestId":125,"target":"ip-10-122-51-241.ec2.internal:20770","response":{"ok":1,"state":2,"v":5,"configTerm":2,"set":"rollback_set_fcv","syncingTo":"ip-10-122-51-241.ec2.internal:20771","term":2,"primaryId":1,"durableOpTime":{"ts":{"$timestamp":{"t":1615192805,"i":1}},"t":2},"durableWallTime":{"$date":"2021-03-08T08:40:05.459Z"},"opTime":{"ts":{"$timestamp":{"t":1615192805,"i":1}},"t":2},"wallTime":{"$date":"2021-03-08T08:40:05.459Z"},"electable":true,"$replData":{"term":2,"lastOpCommitted":{"ts":{"$timestamp":{"t":1615192805,"i":1}},"t":2},"lastCommittedWall":{"$date":"2021-03-08T08:40:05.459Z"},"lastOpVisible":{"ts":{"$timestamp":{"t":1615192805,"i":1}},"t":2},"configVersion":5,"configTerm":2,"replicaSetId":{"$oid":"6045e2dbfd9e9af77a0e716b"},"syncSourceIndex":1,"isPrimary":false},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1615192805,"i":1}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"operationTime":{"$timestamp":{"t":1615192805,"i":1}}}}
     
     
    [js_test:rollback_set_fcv] d20781| 2021-03-08T08:40:07.491+00:00 D2 REPL_HB  4615620 [ReplCoord-2] "Received response to heartbeat","attr":{"requestId":127,"target":"ip-10-122-51-241.ec2.internal:20772","response":{"ok":1,"state":2,"v":6,"configTerm":2,"set":"rollback_set_fcv","syncingTo":"ip-10-122-51-241.ec2.internal:20771","term":2,"primaryId":1,"durableOpTime":{"ts":{"$timestamp":{"t":1615192801,"i":2}},"t":2},"durableWallTime":{"$date":"2021-03-08T08:40:01.459Z"},"opTime":{"ts":{"$timestamp":{"t":1615192801,"i":2}},"t":2},"wallTime":{"$date":"2021-03-08T08:40:01.459Z"},"electable":false,"$replData":{"term":2,"lastOpCommitted":{"ts":{"$timestamp":{"t":1615192805,"i":1}},"t":2},"lastCommittedWall":{"$date":"2021-03-08T08:40:05.459Z"},"lastOpVisible":{"ts":{"$timestamp":{"t":1615192801,"i":2}},"t":2},"configVersion":6,"configTerm":2,"replicaSetId":{"$oid":"6045e2dbfd9e9af77a0e716b"},"syncSourceIndex":1,"isPrimary":false},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1615192805,"i":1}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"operationTime":{"$timestamp":{"t":1615192801,"i":2}}}}
    

  • After this, the primary is network partitioned from the other nodes, and attempts to set its FCV to the target version of 4.9.
  • In order to do this, it must first wait for the current replica set config (configVersion: 6, configTerm: 2) to propagate to all nodes. It seems like the primary never updated its local view of one of the secondaries (node with port 20770), so it still thinks it has an old config. Since the primary is partitioned, it is never able to update its view of the other nodes' configs, which causes the hang in awaitReplication, so it is never able to set its FCV to the target version of 4.9.

One possible solution could be to make sure the initial downgrade reconfig is propagated to all of the nodes in the replica set, rather than just the majority. This way, the primary's view of the other nodes' configs would be up to date before trying to set its FCV.



 Comments   
Comment by Vivian Ge (Inactive) [ 06/Oct/21 ]

Updating the fixversion since branching activities occurred yesterday. This ticket will be in rc0 when it’s been triggered. For more active release information, please keep an eye on #server-release. Thank you!

Comment by Githook User [ 10/Jun/21 ]

Author:

{'name': 'Huayu Ouyang', 'email': 'huayu.ouyang@mongodb.com', 'username': 'huayu-ouyang'}

Message: SERVER-56004 Fix rollback_set_fcv.js where primary hangs when setting FCV after reconfig

(cherry picked from commit 7f56da83d8ce91fa17bd5687a33747fe34e64cf8)
Branch: v5.0
https://github.com/mongodb/mongo/commit/24f30176f35df97b2d2ca569a946287ff4b2938f

Comment by Githook User [ 10/Jun/21 ]

Author:

{'name': 'Huayu Ouyang', 'email': 'huayu.ouyang@mongodb.com', 'username': 'huayu-ouyang'}

Message: SERVER-56004 Fix rollback_set_fcv.js where primary hangs when setting FCV after reconfig
Branch: master
https://github.com/mongodb/mongo/commit/7f56da83d8ce91fa17bd5687a33747fe34e64cf8

Generated at Thu Feb 08 05:38:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.