[DOCS-14370] Investigate changes in SERVER-56004: Fix bug where primary hangs when setting FCV after reconfig Created: 19/Apr/21  Updated: 13/Nov/23  Resolved: 12/Oct/21

Status: Closed
Project: Documentation
Component/s: manual, Server
Affects Version/s: None
Fix Version/s: 5.0.0-rc2, 5.1.0-rc0, Server_Docs_20231030, Server_Docs_20231106, Server_Docs_20231105, Server_Docs_20231113

Type: Task Priority: Major - P3
Reporter: Backlog - Core Eng Program Management Team Assignee: Jason Price
Resolution: Done Votes: 0
Labels: None
Σ Remaining Estimate: Not Specified Remaining Estimate: Not Specified
Σ Time Spent: Not Specified Time Spent: Not Specified
Σ Original Estimate: Not Specified Original Estimate: Not Specified

Issue Links:
Documented
documents SERVER-56004 Fix rollback_set_fcv.js where primary... Closed
Sub-Tasks:
Key
Summary
Type
Status
Assignee
DOCS-14865 BACKPORT to 4.4 Sub-task Closed Jason Price  
DOCS-14866 BACKPORT to 4.2 Sub-task Closed Jason Price  
Participants:
Days since reply: 2 years, 15 weeks, 2 days ago
Epic Link: DOCSP-14300
Story Points: 3

 Description   

Description

Downstream Change Summary

We should document that users should not attempt an upgrade or a downgrade if any member of the replica set is down.

Description of Linked Ticket

We see the following sequence occur when running rollback_set_fcv.js:

  • We see a log message indicating that it has been propagated to a majority of nodes:

    [js_test:rollback_set_fcv] d20781| 2021-03-08T08:40:07.492+00:00 I  -        5042304 [conn2] "The downgraded replica set config has been propagated to a majority"
    

  • It also seems like from heartbeat requests, all nodes have actually updated to this new config of configVersion: 6, configTerm: 2.

    [js_test:rollback_set_fcv] d20781| 2021-03-08T08:40:06.003+00:00 D2 REPL_HB  24095   [conn20] "Received heartbeat request","attr":{"from":"ip-10-122-51-241.ec2.internal:20772","cmdObj":{"replSetHeartbeat":"rollback_set_fcv","configVersion":6,"configTerm":2,"hbv":1,"from":"ip-10-122-51-241.ec2.internal:20772","fromId":2,"term":2,"primaryId":-1,"maxTimeMSOpOnly":10000,"$replData":1,"$clusterTime":{"clusterTime":{"$timestamp":{"t":1615192805,"i":1}},"signature":{"hash":{"$binary":{"base64":"7OtpxE//rC4QyFMzAfGz+yomYA4=","subType":"0"}},"keyId":6937200231259832000}},"$db":"admin"}}
    [js_test:rollback_set_fcv] d20781| 2021-03-08T08:40:06.140+00:00 D2 REPL_HB  24095   [conn5] "Received heartbeat request","attr":{"from":"ip-10-122-51-241.ec2.internal:20770","cmdObj":{"replSetHeartbeat":"rollback_set_fcv","configVersion":6,"configTerm":2,"hbv":1,"from":"ip-10-122-51-241.ec2.internal:20770","fromId":0,"term":2,"maxTimeMSOpOnly":10000,"$replData":1,"$db":"admin"}}
    [js_test:rollback_set_fcv] d20781| 2021-03-08T08:40:07.492+00:00 I  REPL     4508701 [conn2] "The current replica set config is committed","attr":{"configVersion":6,"configTerm":2,"configWaitDurationMillis":2009,"oplogWaitDurationMillis":0,"configOplogCommitmentOpTime":{"ts":{"$timestamp":{"t":1615192804,"i":1}},"t":2}}
    

  • However, on the primary, we only receive heartbeat responses with the new config of configVersion: 6, configTerm: 2 from one of the secondaries:

    [js_test:rollback_set_fcv] d20781| 2021-03-08T08:40:05.658+00:00 D2 REPL_HB  4615620 [ReplCoord-6] "Received response to heartbeat","attr":{"requestId":125,"target":"ip-10-122-51-241.ec2.internal:20770","response":{"ok":1,"state":2,"v":5,"configTerm":2,"set":"rollback_set_fcv","syncingTo":"ip-10-122-51-241.ec2.internal:20771","term":2,"primaryId":1,"durableOpTime":{"ts":{"$timestamp":{"t":1615192805,"i":1}},"t":2},"durableWallTime":{"$date":"2021-03-08T08:40:05.459Z"},"opTime":{"ts":{"$timestamp":{"t":1615192805,"i":1}},"t":2},"wallTime":{"$date":"2021-03-08T08:40:05.459Z"},"electable":true,"$replData":{"term":2,"lastOpCommitted":{"ts":{"$timestamp":{"t":1615192805,"i":1}},"t":2},"lastCommittedWall":{"$date":"2021-03-08T08:40:05.459Z"},"lastOpVisible":{"ts":{"$timestamp":{"t":1615192805,"i":1}},"t":2},"configVersion":5,"configTerm":2,"replicaSetId":{"$oid":"6045e2dbfd9e9af77a0e716b"},"syncSourceIndex":1,"isPrimary":false},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1615192805,"i":1}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"operationTime":{"$timestamp":{"t":1615192805,"i":1}}}}
     
     
    [js_test:rollback_set_fcv] d20781| 2021-03-08T08:40:07.491+00:00 D2 REPL_HB  4615620 [ReplCoord-2] "Received response to heartbeat","attr":{"requestId":127,"target":"ip-10-122-51-241.ec2.internal:20772","response":{"ok":1,"state":2,"v":6,"configTerm":2,"set":"rollback_set_fcv","syncingTo":"ip-10-122-51-241.ec2.internal:20771","term":2,"primaryId":1,"durableOpTime":{"ts":{"$timestamp":{"t":1615192801,"i":2}},"t":2},"durableWallTime":{"$date":"2021-03-08T08:40:01.459Z"},"opTime":{"ts":{"$timestamp":{"t":1615192801,"i":2}},"t":2},"wallTime":{"$date":"2021-03-08T08:40:01.459Z"},"electable":false,"$replData":{"term":2,"lastOpCommitted":{"ts":{"$timestamp":{"t":1615192805,"i":1}},"t":2},"lastCommittedWall":{"$date":"2021-03-08T08:40:05.459Z"},"lastOpVisible":{"ts":{"$timestamp":{"t":1615192801,"i":2}},"t":2},"configVersion":6,"configTerm":2,"replicaSetId":{"$oid":"6045e2dbfd9e9af77a0e716b"},"syncSourceIndex":1,"isPrimary":false},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1615192805,"i":1}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"operationTime":{"$timestamp":{"t":1615192801,"i":2}}}}
    

  • After this, the primary is network partitioned from the other nodes, and attempts to set its FCV to the target version of 4.9.
  • In order to do this, it must first wait for the current replica set config (configVersion: 6, configTerm: 2) to propagate to all nodes. It seems like the primary never updated its local view of one of the secondaries (node with port 20770), so it still thinks it has an old config. Since the primary is partitioned, it is never able to update its view of the other nodes' configs, which causes the hang in awaitReplication, so it is never able to set its FCV to the target version of 4.9.

One possible solution could be to make sure the initial downgrade reconfig is propagated to all of the nodes in the replica set, rather than just the majority. This way, the primary's view of the other nodes' configs would be up to date before trying to set its FCV.

Scope of changes

Impact to Other Docs

MVP (Work and Date)

Resources (Scope or Design Docs, Invision, etc.)



 Comments   
Comment by Githook User [ 25/Oct/21 ]

Author:

{'name': 'Jeff Allen', 'email': 'jeffrey.allen@10gen.com', 'username': 'jeff-allen-mongo'}

Message: Revert "DOCS-14370 primary hangs when setting FCV"

This reverts commit 16e1d977be2d8022f2b54412194abcc343f1f499.
Branch: v4.2
https://github.com/mongodb/docs/commit/f4cd4e09aca930a29b4bf2fcfb243c31621e3197

Comment by Githook User [ 12/Oct/21 ]

Author:

{'name': 'jason-price-mongodb', 'email': 'jshfjghsdfgjsdjh@aolsdjfhkjsdhfkjsdf.com'}

Message: DOCS-14370 primary hangs when setting FCV

DOCS-14370 primary hangs when setting FCV
Branch: v4.2
https://github.com/mongodb/docs/commit/16e1d977be2d8022f2b54412194abcc343f1f499

Comment by Githook User [ 12/Oct/21 ]

Author:

{'name': 'jason-price-mongodb', 'email': 'jshfjghsdfgjsdjh@aolsdjfhkjsdhfkjsdf.com'}

Message: DOCS-14370 primary hangs when setting FCV

DOCS-14370 primary hangs when setting FCV
Branch: v4.4
https://github.com/mongodb/docs/commit/fe69bde427af655d8eb4441b6bcbab079ea5d0e2

Comment by Githook User [ 12/Oct/21 ]

Author:

{'name': 'jason-price-mongodb', 'email': 'jshfjghsdfgjsdjh@aolsdjfhkjsdhfkjsdf.com'}

Message: DOCS-14370 primary hangs when setting FCV

DOCS-14370 primary hangs when setting FCV
Branch: v5.0
https://github.com/mongodb/docs/commit/94591d187011161fca2a03916bbeb2ab6acbaa9a

Comment by Githook User [ 12/Oct/21 ]

Author:

{'name': 'jason-price-mongodb', 'email': 'jshfjghsdfgjsdjh@aolsdjfhkjsdhfkjsdf.com'}

Message: DOCS-14370 primary hangs when setting FCV

DOCS-14370 primary hangs when setting FCV
Branch: master
https://github.com/mongodb/docs/commit/32b1c5f7d68f13bed3c92713f619221940145995

Generated at Thu Feb 08 08:10:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.