[SERVER-67348] Fix race condition on set_cluster_parameter.js Created: 17/Jun/22  Updated: 29/Oct/23  Resolved: 08/Jul/22

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 6.1.0-rc0

Type: Task Priority: Major - P3
Reporter: Jordi Serra Torrens Assignee: Marcos José Grillo Ramirez
Resolution: Fixed Votes: 0
Labels: sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File 0001-repro-SERVER-67348.patch    
Issue Links:
Depends
Backwards Compatibility: Fully Compatible
Sprint: Sharding EMEA 2022-07-11
Participants:
Linked BF Score: 11

 Description   

set_cluster_param.js tests that the following two actions are serialized:

  • SetClusterParameterCoordinator's propagation of the new parameter value to all shards
  • addShard

To check that, the tests uses the 'failCommand' failpoint with an 'errorCode' PrimarySteppedDown, which the coordinator will retry on. However, the coordinator only reties the _shardsvrSetClusterParameter command a fixed amount of attempts. Once the attempts are exhausted, the coordinator will release the lock that was ensuring the serialization, and retry again later.

This is fine, but the test can fail because it can issue the addShard command at a moment where the coordinator is not holding the lock after having exhausted retries.



 Comments   
Comment by Githook User [ 08/Jul/22 ]

Author:

{'name': 'Marcos José Grillo Ramirez', 'email': 'marcos.grillo@mongodb.com', 'username': 'm4nti5'}

Message: SERVER-67348 Use hangInShardsvrSetClusterParameter failpoint instead of making the command fail with a retryable error in set_cluster_parameters.js
Branch: master
https://github.com/mongodb/mongo/commit/e362ae89a375ad88e55bff7b55fbf08ba472e6e2

Comment by Jordi Serra Torrens [ 17/Jun/22 ]

Attached repro.

A simple solution is to mimic what jstests/sharding/set_user_write_block_mode.js does: It uses a failpoint to block a command, instead of relying on the 'failCommand' failpoint making the _shardsvrSetClusterParameter fail with a retriable error.

Generated at Thu Feb 08 06:07:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.