-
Type:
Bug
-
Resolution: Fixed
-
Priority:
Major - P3
-
None
-
Component/s: Mongo Orchestration
-
None
-
Not Needed
Summary
Mongo-orchestration sometimes fails to create a sharded cluster with a 500 error. This race seems to happen more frequently when auth is enabled, and when the number of shards increases.
Traceback (most recent call last): File "/home/ubuntu/.local/lib/python3.10/site-packages/mongo_orchestration/apps/__init__.py", line 66, in wrap return f(*arg, **kwd) File "/home/ubuntu/.local/lib/python3.10/site-packages/mongo_orchestration/apps/sharded_clusters.py", line 68, in sh_create result = _sh_create(data) File "/home/ubuntu/.local/lib/python3.10/site-packages/mongo_orchestration/apps/sharded_clusters.py", line 44, in _sh_create cluster_id = ShardedClusters().create(params) File "/home/ubuntu/.local/lib/python3.10/site-packages/mongo_orchestration/sharded_clusters.py", line 542, in create cluster = ShardedCluster(params) File "/home/ubuntu/.local/lib/python3.10/site-packages/mongo_orchestration/sharded_clusters.py", line 105, in __init__ f.result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result return self.__get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run result = self.fn(*self.args, **self.kwargs) File "/home/ubuntu/.local/lib/python3.10/site-packages/mongo_orchestration/sharded_clusters.py", line 98, in add_shard info = self.member_add(cfg.get('id', None), shard_params) File "/home/ubuntu/.local/lib/python3.10/site-packages/mongo_orchestration/sharded_clusters.py", line 409, in member_add result = self._add(cfgs, member_id) File "/home/ubuntu/.local/lib/python3.10/site-packages/mongo_orchestration/sharded_clusters.py", line 386, in _add return self.router_command("addShard", (shard_uri, {"name": name}), is_eval=False) File "/home/ubuntu/.local/lib/python3.10/site-packages/mongo_orchestration/sharded_clusters.py", line 375, in router_command result = getattr(self.connection().admin, mode)(command, name, **d) File "/home/ubuntu/.local/lib/python3.10/site-packages/pymongo/_csot.py", line 105, in csot_wrapper return func(self, *args, **kwargs) File "/home/ubuntu/.local/lib/python3.10/site-packages/pymongo/database.py", line 809, in command return self._command( File "/home/ubuntu/.local/lib/python3.10/site-packages/pymongo/database.py", line 688, in _command return sock_info.command( File "/home/ubuntu/.local/lib/python3.10/site-packages/pymongo/pool.py", line 767, in command return command( File "/home/ubuntu/.local/lib/python3.10/site-packages/pymongo/network.py", line 166, in command helpers._check_command_response( File "/home/ubuntu/.local/lib/python3.10/site-packages/pymongo/helpers.py", line 181, in _check_command_response raise OperationFailure(errmsg, code, response, max_wire_version) pymongo.errors.OperationFailure: Another addShard with different arguments is already running with different options, full error: {'ok': 0.0, 'errmsg': 'Another addShard with different arguments is already running with different options', 'code': 117, 'codeName': 'ConflictingOperationInProgress', '$clusterTime': {'clusterTime': Timestamp(1771891141, 61), 'signature': {'hash': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 'keyId': 0}}, 'operationTime': Timestamp(1771891141, 60)}
An example command that can be used to trigger this race (if needed bump up the number of shards):
curl -XPOST --data "{\"id\": \"shard-test0\", \"auth_key\": \"secret\", \"login\": \"admin\", \"password\": \"password\", \"shards\": [{\"id\": \"sh00\", \"shardParams\": {\"members\": [{}, {}, {}]}}, {\"id\": \"sh01\", \"shardParams\": {\"members\": [{}, {}, {}]}}, {\"id\": \"sh02\", \"shardParams\": {\"members\": [{}, {}, {}]}}, {\"id\": \"sh03\", \"shardParams\": {\"members\": [{}, {}, {}]}}, {\"id\": \"sh04\", \"shardParams\": {\"members\": [{}, {}, {}]}}], \"routers\": [{}]}" "http://localhost:8889/v1/sharded_clusters"
Motivation
Who is the affected end user?
We use mongo-orchestration in our integration testing.
How does this affect the end user?
This causes sporadic failures on our CI.
How likely is it that this problem or use case will occur?
The race seems to occur about 1/20 runs on our CI.
If the problem does occur, what are the consequences and how severe are they?
Test failures.
Is this issue urgent?
No timeline, we have a retry workaround in place, but it is not ideal.
Is this ticket required by a downstream team?
TUNE.
Is this ticket only for tests?
Testing only.
Acceptance Criteria
Fix the bug.