Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-22620

Improve mongos handling of a very stale secondary config server

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • 3.3.11
    • Affects Version/s: 3.2.1
    • Component/s: Replication, Sharding
    • None
    • Fully Compatible
    • ALL
    • Hide
      1. Create a sharded cluster (1 shard is enough)
      2. Make sure you have a replica set of config servers with 3 nodes
      3. Shard a collection
      4. Connect to one of the config servers secondary and lock it with the "db.fsyncLock()" command
      5. Try to shard something else and watch how the mongos times out. For example:
        mongos> sh.shardCollection("test.testcol2", {a:1})
        { "ok" : 0, "errmsg" : "Operation timed out", "code" : 50 }
        

        Balancer is failing to work:

        2016-02-16T02:30:35.208+0000 I SHARDING [Balancer] about to log metadata event into actionlog: { _id: "dmnx-6-2016-02-16T02:30:35.208+0000-56c289cbaee7ebe12efbfaee", server: "dmnx-6", clientAddr: "", time: new Date(1455589835208), what: "balancer.round", ns: "", details: { executionTimeMillis: 30016, errorOccured: false, candidateChunks: 0, chunksMoved: 0 } }
        2016-02-16T02:31:15.244+0000 W SHARDING [Balancer] ExceededTimeLimit Operation timed out
        
      Show
      Create a sharded cluster (1 shard is enough) Make sure you have a replica set of config servers with 3 nodes Shard a collection Connect to one of the config servers secondary and lock it with the "db.fsyncLock()" command Try to shard something else and watch how the mongos times out. For example: mongos> sh.shardCollection("test.testcol2", {a:1}) { "ok" : 0, "errmsg" : "Operation timed out", "code" : 50 } Balancer is failing to work: 2016-02-16T02:30:35.208+0000 I SHARDING [Balancer] about to log metadata event into actionlog: { _id: "dmnx-6-2016-02-16T02:30:35.208+0000-56c289cbaee7ebe12efbfaee", server: "dmnx-6", clientAddr: "", time: new Date(1455589835208), what: "balancer.round", ns: "", details: { executionTimeMillis: 30016, errorOccured: false, candidateChunks: 0, chunksMoved: 0 } } 2016-02-16T02:31:15.244+0000 W SHARDING [Balancer] ExceededTimeLimit Operation timed out
    • Sharding 18 (08/05/16), Sharding 2016-08-29

      This ticket is to improve the sharding handling of very stale secondary config servers (although it would apply to shards as well). The proposed solution is for the isMaster response to include the latest optime it has replicated to, so that the replica set monitor, in addition to selecting 'nearer' hosts will also prefer those with most recent optimes.

      This problem is also present in the case of fsyncLocked secondaries. It seems that mongos is unable to work properly if one of the config servers (RS) secondaries is locked with db.fsyncLock(). I have tried running some write concern / read concern operations directly on the replica set while a secondary is locked that way and found no problem. Thus it must be the problem of the mongos alone.

            Assignee:
            misha.tyulenev@mongodb.com Misha Tyulenev (Inactive)
            Reporter:
            dmitry.ryabtsev@mongodb.com Dmitry Ryabtsev
            Votes:
            0 Vote for this issue
            Watchers:
            17 Start watching this issue

              Created:
              Updated:
              Resolved: