Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-41584

Deadlock shutting down the PeriodicRunner while the LogicalSessionCache is refreshing

    • ALL
    • 18

      There is a deadlock in the following scenario:

      1. The secondary nodes of the config server shard are shut down cleanly
      2. The test runner then attempts to shut down the primary shard of the config server (on mongod, process 28261, see stack trace above) and the shutdown process is initiated
      3. The LogicalSessionCache then begins a refresh, which tries to create config.system.sessions. This sends ShardsvrShardCollection to one of the shards.
      4. The shard then sends ConfigsvrCreateCollection back to the config server primary, which then tries to take the database dist lock, and then hangs in ReplicationCoordinatorImpl::waitUntilOpTimeForRead because the secondaries are down and the optime never advances
      5. In the meantime, the shutdown thread advances to shut down the PeriodicRunner, but the LogicalSessionCacheRefresh job is running inside the PeriodicRunner, and it's hanging, so the PeriodicRunner can never shut down.

      There are a lot of potential fixes, including:

      1. Separate PeriodicRunner shutdown from join, so that we can kill all operations after shutting down the runner and then join jobs after that
      2. Put a timeout on the read for the distlock acquisition

            Assignee:
            amirsaman.memaripour@mongodb.com Amirsaman Memaripour
            Reporter:
            matthew.saltz@mongodb.com Matthew Saltz (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: