Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-22620

Improve mongos handling of a very stale secondary config server

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: 3.2.1
    • Fix Version/s: 3.3.11
    • Component/s: Replication, Sharding
    • Labels:
      None
    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Steps To Reproduce:
      Hide
      1. Create a sharded cluster (1 shard is enough)
      2. Make sure you have a replica set of config servers with 3 nodes
      3. Shard a collection
      4. Connect to one of the config servers secondary and lock it with the "db.fsyncLock()" command
      5. Try to shard something else and watch how the mongos times out. For example:

        mongos> sh.shardCollection("test.testcol2", {a:1})
        { "ok" : 0, "errmsg" : "Operation timed out", "code" : 50 }
        

        Balancer is failing to work:

        2016-02-16T02:30:35.208+0000 I SHARDING [Balancer] about to log metadata event into actionlog: { _id: "dmnx-6-2016-02-16T02:30:35.208+0000-56c289cbaee7ebe12efbfaee", server: "dmnx-6", clientAddr: "", time: new Date(1455589835208), what: "balancer.round", ns: "", details: { executionTimeMillis: 30016, errorOccured: false, candidateChunks: 0, chunksMoved: 0 } }
        2016-02-16T02:31:15.244+0000 W SHARDING [Balancer] ExceededTimeLimit Operation timed out
        

      Show
      Create a sharded cluster (1 shard is enough) Make sure you have a replica set of config servers with 3 nodes Shard a collection Connect to one of the config servers secondary and lock it with the "db.fsyncLock()" command Try to shard something else and watch how the mongos times out. For example: mongos> sh.shardCollection("test.testcol2", {a:1}) { "ok" : 0, "errmsg" : "Operation timed out", "code" : 50 } Balancer is failing to work: 2016-02-16T02:30:35.208+0000 I SHARDING [Balancer] about to log metadata event into actionlog: { _id: "dmnx-6-2016-02-16T02:30:35.208+0000-56c289cbaee7ebe12efbfaee", server: "dmnx-6", clientAddr: "", time: new Date(1455589835208), what: "balancer.round", ns: "", details: { executionTimeMillis: 30016, errorOccured: false, candidateChunks: 0, chunksMoved: 0 } } 2016-02-16T02:31:15.244+0000 W SHARDING [Balancer] ExceededTimeLimit Operation timed out
    • Sprint:
      Sharding 18 (08/05/16), Sharding 2016-08-29
    • Case:

      Description

      This ticket is to improve the sharding handling of very stale secondary config servers (although it would apply to shards as well). The proposed solution is for the isMaster response to include the latest optime it has replicated to, so that the replica set monitor, in addition to selecting 'nearer' hosts will also prefer those with most recent optimes.

      This problem is also present in the case of fsyncLocked secondaries. It seems that mongos is unable to work properly if one of the config servers (RS) secondaries is locked with db.fsyncLock(). I have tried running some write concern / read concern operations directly on the replica set while a secondary is locked that way and found no problem. Thus it must be the problem of the mongos alone.

        Attachments

          Issue Links

            Activity

              People

              Votes:
              0 Vote for this issue
              Watchers:
              17 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: