Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-33639

Concurrent writes against non-existent database can fail due to distlock acquisition timeout at `createDatabase` time

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: 3.6.3, 3.7.2
    • Fix Version/s: 3.6.6, 4.0.0-rc1, 4.1.1
    • Component/s: Sharding
    • Labels:
      None
    • Backwards Compatibility:
      Fully Compatible
    • Backport Requested:
      v4.0
    • Sprint:
      Sharding 2018-05-21, Sharding 2018-06-04
    • Linked BF Score:
      0

      Description

      Starting with MongoDB 3.6.0, the creation of sharded databases was made explicit from the point of view of MongoS and the creation logic was moved to the config server. Since the default distributed lock acquisition timeout is still 20 seconds, this causes timeouts when large number of threads suddenly try to write against a database, which does not exist.

      What happens is a convoying effect on the -movePrimary distributed lock, which times out and fails writes even though the database is already created. I am able to reproduce this problem 100% using the load phase of the YCSB benchmark with 40 threads.

      In order to avoid this effect, before taking the distributed lock, we should take some form of lock manager X lock, like with the other metadata commands after which we should check the database for existence before taking the distributed lock, in order to mitigate the convoying effect.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              janna.golden Janna Golden
              Reporter:
              kaloian.manassiev Kaloian Manassiev
              Participants:
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: