Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 5.0.0-rc2, 5.1.0-rc0
Affects Version/s: None
Component/s: Replication
Labels:
None

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Sprint:
Repl 2021-06-14
Linked BF Score:
18
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

We see the following sequence occur when running rollback_set_fcv.js:

It first successfully runs a test which ends with the replica set in version 4.9.
Then, in the second test, the primary first sets the FCV of the entire replica set to 4.4. Since this is a downgrade from 4.9, we issue a reconfig, which waits to be propagated to a majority of the nodes.

We see a log message indicating that it has been propagated to a majority of nodes:

[js_test:rollback_set_fcv] d20781| 2021-03-08T08:40:07.492+00:00 I  -        5042304 [conn2] "The downgraded replica set config has been propagated to a majority"

It also seems like from heartbeat requests, all nodes have actually updated to this new config of configVersion: 6, configTerm: 2.

[js_test:rollback_set_fcv] d20781| 2021-03-08T08:40:06.003+00:00 D2 REPL_HB  24095   [conn20] "Received heartbeat request","attr":{"from":"ip-10-122-51-241.ec2.internal:20772","cmdObj":{"replSetHeartbeat":"rollback_set_fcv","configVersion":6,"configTerm":2,"hbv":1,"from":"ip-10-122-51-241.ec2.internal:20772","fromId":2,"term":2,"primaryId":-1,"maxTimeMSOpOnly":10000,"$replData":1,"$clusterTime":{"clusterTime":{"$timestamp":{"t":1615192805,"i":1}},"signature":{"hash":{"$binary":{"base64":"7OtpxE//rC4QyFMzAfGz+yomYA4=","subType":"0"}},"keyId":6937200231259832000}},"$db":"admin"}}
[js_test:rollback_set_fcv] d20781| 2021-03-08T08:40:06.140+00:00 D2 REPL_HB  24095   [conn5] "Received heartbeat request","attr":{"from":"ip-10-122-51-241.ec2.internal:20770","cmdObj":{"replSetHeartbeat":"rollback_set_fcv","configVersion":6,"configTerm":2,"hbv":1,"from":"ip-10-122-51-241.ec2.internal:20770","fromId":0,"term":2,"maxTimeMSOpOnly":10000,"$replData":1,"$db":"admin"}}
[js_test:rollback_set_fcv] d20781| 2021-03-08T08:40:07.492+00:00 I  REPL     4508701 [conn2] "The current replica set config is committed","attr":{"configVersion":6,"configTerm":2,"configWaitDurationMillis":2009,"oplogWaitDurationMillis":0,"configOplogCommitmentOpTime":{"ts":{"$timestamp":{"t":1615192804,"i":1}},"t":2}}

However, on the primary, we only receive heartbeat responses with the new config of configVersion: 6, configTerm: 2 from one of the secondaries:

[js_test:rollback_set_fcv] d20781| 2021-03-08T08:40:05.658+00:00 D2 REPL_HB  4615620 [ReplCoord-6] "Received response to heartbeat","attr":{"requestId":125,"target":"ip-10-122-51-241.ec2.internal:20770","response":{"ok":1,"state":2,"v":5,"configTerm":2,"set":"rollback_set_fcv","syncingTo":"ip-10-122-51-241.ec2.internal:20771","term":2,"primaryId":1,"durableOpTime":{"ts":{"$timestamp":{"t":1615192805,"i":1}},"t":2},"durableWallTime":{"$date":"2021-03-08T08:40:05.459Z"},"opTime":{"ts":{"$timestamp":{"t":1615192805,"i":1}},"t":2},"wallTime":{"$date":"2021-03-08T08:40:05.459Z"},"electable":true,"$replData":{"term":2,"lastOpCommitted":{"ts":{"$timestamp":{"t":1615192805,"i":1}},"t":2},"lastCommittedWall":{"$date":"2021-03-08T08:40:05.459Z"},"lastOpVisible":{"ts":{"$timestamp":{"t":1615192805,"i":1}},"t":2},"configVersion":5,"configTerm":2,"replicaSetId":{"$oid":"6045e2dbfd9e9af77a0e716b"},"syncSourceIndex":1,"isPrimary":false},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1615192805,"i":1}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"operationTime":{"$timestamp":{"t":1615192805,"i":1}}}}


[js_test:rollback_set_fcv] d20781| 2021-03-08T08:40:07.491+00:00 D2 REPL_HB  4615620 [ReplCoord-2] "Received response to heartbeat","attr":{"requestId":127,"target":"ip-10-122-51-241.ec2.internal:20772","response":{"ok":1,"state":2,"v":6,"configTerm":2,"set":"rollback_set_fcv","syncingTo":"ip-10-122-51-241.ec2.internal:20771","term":2,"primaryId":1,"durableOpTime":{"ts":{"$timestamp":{"t":1615192801,"i":2}},"t":2},"durableWallTime":{"$date":"2021-03-08T08:40:01.459Z"},"opTime":{"ts":{"$timestamp":{"t":1615192801,"i":2}},"t":2},"wallTime":{"$date":"2021-03-08T08:40:01.459Z"},"electable":false,"$replData":{"term":2,"lastOpCommitted":{"ts":{"$timestamp":{"t":1615192805,"i":1}},"t":2},"lastCommittedWall":{"$date":"2021-03-08T08:40:05.459Z"},"lastOpVisible":{"ts":{"$timestamp":{"t":1615192801,"i":2}},"t":2},"configVersion":6,"configTerm":2,"replicaSetId":{"$oid":"6045e2dbfd9e9af77a0e716b"},"syncSourceIndex":1,"isPrimary":false},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1615192805,"i":1}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"operationTime":{"$timestamp":{"t":1615192801,"i":2}}}}

After this, the primary is network partitioned from the other nodes, and attempts to set its FCV to the target version of 4.9.
In order to do this, it must first wait for the current replica set config (configVersion: 6, configTerm: 2) to propagate to all nodes. It seems like the primary never updated its local view of one of the secondaries (node with port 20770), so it still thinks it has an old config. Since the primary is partitioned, it is never able to update its view of the other nodes' configs, which causes the hang in awaitReplication, so it is never able to set its FCV to the target version of 4.9.

One possible solution could be to make sure the initial downgrade reconfig is propagated to all of the nodes in the replica set, rather than just the majority. This way, the primary's view of the other nodes' configs would be up to date before trying to set its FCV.

Assignee:: Huayu Ouyang
Reporter:: Huayu Ouyang
Participants:: Githook User, Huayu Ouyang, Vivian Ge
Votes:: 0 Vote for this issue
Watchers:: 6 Start watching this issue

Created:: Apr 09 2021 07:32:45 PM UTC
Updated:: Oct 29 2023 09:55:04 PM UTC
Resolved:: Jun 10 2021 08:56:19 PM UTC

Details

Description

Attachments

Activity

People

Dates