Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-64641

Deadlock invariant tripped in shard split SingleServerDiscoveryMonitor

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 6.0.0-rc0
    • Affects Version/s: None
    • Component/s: None
    • Labels:
      None
    • Fully Compatible
    • ALL
    • 140

      The source of the issue appears to be improper shutdown of EventsPublisher/SingleServerDiscoveryMonitor. Following a similar logic to StreamableReplicaSetMonitor should enable us to gracefully shutdown these components.

       

      From this build failure:
      https://spruce.mongodb.com/task/mongodb_mongo_master_enterprise_rhel_80_64_bit_dynamic_all_feature_flags_required_serverless_patch_b1dc7f546a006efa5edf063286e4368ca603fe48_62349a030ae6061771e6d5da_22_03_18_14_41_15/tests?execution=0&sortBy=STATUS&sortDir=ASC

      [js_test:shard_split_basic_test] d20270| 2022-03-18T15:16:15.247+00:00 I  -        4333222 [ShardSplitDonorService-3] "RSM received error response","attr":{"host":"ip-10-122-17-171.ec2.internal:20275","error":"ShutdownInProgress: Shutdown in progress","replicaSet":"","response":{}}
      [js_test:shard_split_basic_test] d20270| 2022-03-18T15:16:15.247+00:00 F  -        5106800 [ShardSplitDonorService-3] "Theoretical deadlock found on use of latch","attr":{"reason":"Latch acquired after other latch of lower level","latch":{"name":"TopologyEventsPublisher::_eventQueueMutex","latchId":11855,"level":6,"file":"src/mongo/client/sdam/topology_listener.h","line":99},"latchesHeld":[{"name":"SingleServerDiscoveryMonitor::mutex","latchId":11858,"level":4,"file":"src/mongo/client/server_discovery_monitor.cpp","line":85}]}
      [js_test:shard_split_basic_test] d20270| 2022-03-18T15:16:15.247+00:00 F  ASSERT   23089   [ShardSplitDonorService-3] "Fatal assertion","attr":{"msgid":5106800,"file":"src/mongo/util/latch_analyzer.cpp","line":229}
      [js_test:shard_split_basic_test] d20270| 2022-03-18T15:16:15.247+00:00 F  ASSERT   23090   [ShardSplitDonorService-3] "\n\n***aborting after fassert() failure\n\n"
      [js_test:shard_split_basic_test] d20270| 2022-03-18T15:16:15.247+00:00 F  CONTROL  4757800 [ShardSplitDonorService-3] "Writing fatal message","attr":{"message":"Got signal: 6 (Aborted).\n"}
      

            Assignee:
            didier.nadeau@mongodb.com Didier Nadeau
            Reporter:
            matt.broadstone@mongodb.com Matt Broadstone
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: