[SERVER-46894] Wait for the current config to be committed before running reconfig Created: 16/Mar/20  Updated: 29/Oct/23  Resolved: 04/Apr/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.4.0-rc0, 4.7.0

Type: Improvement Priority: Major - P3
Reporter: Siyuan Zhou Assignee: Siyuan Zhou
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Documented
is documented by DOCS-13599 Investigate changes in SERVER-46894: ... Closed
Related
is related to SERVER-45087 Check Oplog Commitment condition on r... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v4.4
Sprint: Repl 2020-03-23, Repl 2020-04-06
Participants:
Linked BF Score: 0

 Description   

Currently, we wait for both Config Replication and Oplog Commitment at the end of reconfig command. The former guarantees the new oplog cannot be "rolled back", the latter guarantees the following config can be accepted. However, it seems better to wait for the latter only when it's needed on receiving the following config.

As an example, adding one node with votes: 1 to a single node replset will have to wait for the initial sync to finish before returning the reconfig currently.

Another case is after an election, Oplog Commitment requires the first optime in its term to be committed, which may fail a reconfig following the election immediately. Waiting for the first optime in its term to be committed will make the command finish successfully.



 Comments   
Comment by Githook User [ 05/Apr/20 ]

Author:

{'name': 'Siyuan Zhou', 'email': 'siyuan.zhou@mongodb.com', 'username': 'visualzhou'}

Message: SERVER-46894 Wait for the current config to be committed before running reconfig.

This also changes the behavior of waiting for both Config Replication and
Oplog Commitment proactively at the end of safe reconfig. Both conditions guarantee
the safety of the next reconfig. Now we only wait for Config Replication at the
end of safe reconfig and wait for both condition at the beginning of the
next reconfig when the waiting is necessary.

(cherry picked from commit 89ec7322a58686b89aa71f26b1f050ded94cf949)
Branch: v4.4
https://github.com/mongodb/mongo/commit/0916dcd9d48917c520553bfab45fcfe5ead85e52

Comment by Githook User [ 04/Apr/20 ]

Author:

{'name': 'Siyuan Zhou', 'email': 'siyuan.zhou@mongodb.com', 'username': 'visualzhou'}

Message: SERVER-46894 Wait for the current config to be committed before running reconfig.

This also changes the behavior of waiting for both Config Replication and
Oplog Commitment proactively at the end of safe reconfig. Both conditions guarantee
the safety of the next reconfig. Now we only wait for Config Replication at the
end of safe reconfig and wait for both condition at the beginning of the
next reconfig when the waiting is necessary.
Branch: master
https://github.com/mongodb/mongo/commit/89ec7322a58686b89aa71f26b1f050ded94cf949

Comment by Siyuan Zhou [ 19/Mar/20 ]

If the primary fails to wait for Config Replication or Oplog Commitment for the current config after maxTimeMS, it leaves the current config unchanged and returns a new error code “CurrentConfigNotCommittedYet”, with the following example error messages:

  • New config is rejected since the current config with {version: 8, term: 3} has not yet propagated to a majority of voting nodes within maxTimeMS: <timeout>.
  • New config is rejected since the last committed optime in the previous config (<OpTime>) has not yet become committed in the current config with {version: 8, term: 3} within maxTimeMS: <timeout>.

If the primary fails to wait for Config Replication of the new config at the end of reconfig, it will return the “ExceededTimeLimit” error code with the following message:

  • Reconfig with {version: 8, term: 3} finished but did not propagate to a majority of voting nodes within maxTimeMS: <timeout>.
Comment by Siyuan Zhou [ 19/Mar/20 ]

Quote from evin.roesle: "We would like to show the log message of waiting by default so that users have the ability to understand that it is hung without actually installing the reconfig." This ticket needs to make sure the log message is shown by default.

Generated at Thu Feb 08 05:12:44 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.