mongo-orchestration racy failure when creating sharded clusters

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Fixed
    • Priority: Major - P3
    • None
    • Component/s: Mongo Orchestration
    • None
    • Not Needed

      Summary

      Mongo-orchestration sometimes fails to create a sharded cluster with a 500 error. This race seems to happen more frequently when auth is enabled, and when the number of shards increases.

      Traceback (most recent call last):
        File "/home/ubuntu/.local/lib/python3.10/site-packages/mongo_orchestration/apps/__init__.py", line 66, in wrap
          return f(*arg, **kwd)
        File "/home/ubuntu/.local/lib/python3.10/site-packages/mongo_orchestration/apps/sharded_clusters.py", line 68, in sh_create
          result = _sh_create(data)
        File "/home/ubuntu/.local/lib/python3.10/site-packages/mongo_orchestration/apps/sharded_clusters.py", line 44, in _sh_create
          cluster_id = ShardedClusters().create(params)
        File "/home/ubuntu/.local/lib/python3.10/site-packages/mongo_orchestration/sharded_clusters.py", line 542, in create
          cluster = ShardedCluster(params)
        File "/home/ubuntu/.local/lib/python3.10/site-packages/mongo_orchestration/sharded_clusters.py", line 105, in __init__
          f.result()
        File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
          return self.__get_result()
        File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
          raise self._exception
        File "/usr/lib/python3.10/concurrent/futures/thread.py", line 58, in run
          result = self.fn(*self.args, **self.kwargs)
        File "/home/ubuntu/.local/lib/python3.10/site-packages/mongo_orchestration/sharded_clusters.py", line 98, in add_shard
          info = self.member_add(cfg.get('id', None), shard_params)
        File "/home/ubuntu/.local/lib/python3.10/site-packages/mongo_orchestration/sharded_clusters.py", line 409, in member_add
          result = self._add(cfgs, member_id)
        File "/home/ubuntu/.local/lib/python3.10/site-packages/mongo_orchestration/sharded_clusters.py", line 386, in _add
          return self.router_command("addShard", (shard_uri, {"name": name}), is_eval=False)
        File "/home/ubuntu/.local/lib/python3.10/site-packages/mongo_orchestration/sharded_clusters.py", line 375, in router_command
          result = getattr(self.connection().admin, mode)(command, name, **d)
        File "/home/ubuntu/.local/lib/python3.10/site-packages/pymongo/_csot.py", line 105, in csot_wrapper
          return func(self, *args, **kwargs)
        File "/home/ubuntu/.local/lib/python3.10/site-packages/pymongo/database.py", line 809, in command
          return self._command(
        File "/home/ubuntu/.local/lib/python3.10/site-packages/pymongo/database.py", line 688, in _command
          return sock_info.command(
        File "/home/ubuntu/.local/lib/python3.10/site-packages/pymongo/pool.py", line 767, in command
          return command(
        File "/home/ubuntu/.local/lib/python3.10/site-packages/pymongo/network.py", line 166, in command
          helpers._check_command_response(
        File "/home/ubuntu/.local/lib/python3.10/site-packages/pymongo/helpers.py", line 181, in _check_command_response
          raise OperationFailure(errmsg, code, response, max_wire_version)
      pymongo.errors.OperationFailure: Another addShard with different arguments is already running with different options, full error: {'ok': 0.0, 'errmsg': 'Another addShard with different arguments is already running with different options', 'code': 117, 'codeName': 'ConflictingOperationInProgress', '$clusterTime': {'clusterTime': Timestamp(1771891141, 61), 'signature': {'hash': b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 'keyId': 0}}, 'operationTime': Timestamp(1771891141, 60)} 

      An example command that can be used to trigger this race (if needed bump up the number of shards):

      curl -XPOST --data "{\"id\": \"shard-test0\", \"auth_key\": \"secret\", \"login\": \"admin\", \"password\": \"password\", \"shards\": [{\"id\": \"sh00\", \"shardParams\": {\"members\": [{}, {}, {}]}}, {\"id\": \"sh01\", \"shardParams\": {\"members\": [{}, {}, {}]}}, {\"id\": \"sh02\", \"shardParams\": {\"members\": [{}, {}, {}]}}, {\"id\": \"sh03\", \"shardParams\": {\"members\": [{}, {}, {}]}}, {\"id\": \"sh04\", \"shardParams\": {\"members\": [{}, {}, {}]}}], \"routers\": [{}]}" "http://localhost:8889/v1/sharded_clusters" 

      Motivation

      Who is the affected end user?

      We use mongo-orchestration in our integration testing.

      How does this affect the end user?

      This causes sporadic failures on our CI.

      How likely is it that this problem or use case will occur?

      The race seems to occur about 1/20 runs on our CI.

      If the problem does occur, what are the consequences and how severe are they?

      Test failures.

      Is this issue urgent?

      No timeline, we have a retry workaround in place, but it is not ideal.

      Is this ticket required by a downstream team?

      TUNE.

      Is this ticket only for tests?

      Testing only.

      Acceptance Criteria

      Fix the bug.

            Assignee:
            Shane Harvey
            Reporter:
            James Stone
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: