Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Major - P3
Fix Version/s: 3.3.11
Affects Version/s: 3.2.1
Component/s: Replication, Sharding
Labels:
None

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Steps To Reproduce:
Hide

Create a sharded cluster (1 shard is enough)

Make sure you have a replica set of config servers with 3 nodes

Shard a collection

Connect to one of the config servers secondary and lock it with the "db.fsyncLock()" command

Try to shard something else and watch how the mongos times out. For example:

mongos> sh.shardCollection("test.testcol2", {a:1}) { "ok" : 0, "errmsg" : "Operation timed out", "code" : 50 }

Balancer is failing to work:

2016-02-16T02:30:35.208+0000 I SHARDING [Balancer] about to log metadata event into actionlog: { _id: "dmnx-6-2016-02-16T02:30:35.208+0000-56c289cbaee7ebe12efbfaee", server: "dmnx-6", clientAddr: "", time: new Date(1455589835208), what: "balancer.round", ns: "", details: { executionTimeMillis: 30016, errorOccured: false, candidateChunks: 0, chunksMoved: 0 } } 2016-02-16T02:31:15.244+0000 W SHARDING [Balancer] ExceededTimeLimit Operation timed out
Show
Create a sharded cluster (1 shard is enough) Make sure you have a replica set of config servers with 3 nodes Shard a collection Connect to one of the config servers secondary and lock it with the "db.fsyncLock()" command Try to shard something else and watch how the mongos times out. For example: mongos> sh.shardCollection("test.testcol2", {a:1}) { "ok" : 0, "errmsg" : "Operation timed out", "code" : 50 } Balancer is failing to work: 2016-02-16T02:30:35.208+0000 I SHARDING [Balancer] about to log metadata event into actionlog: { _id: "dmnx-6-2016-02-16T02:30:35.208+0000-56c289cbaee7ebe12efbfaee", server: "dmnx-6", clientAddr: "", time: new Date(1455589835208), what: "balancer.round", ns: "", details: { executionTimeMillis: 30016, errorOccured: false, candidateChunks: 0, chunksMoved: 0 } } 2016-02-16T02:31:15.244+0000 W SHARDING [Balancer] ExceededTimeLimit Operation timed out
Sprint:
Sharding 18 (08/05/16), Sharding 2016-08-29
Case:
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

This ticket is to improve the sharding handling of very stale secondary config servers (although it would apply to shards as well). The proposed solution is for the isMaster response to include the latest optime it has replicated to, so that the replica set monitor, in addition to selecting 'nearer' hosts will also prefer those with most recent optimes.

This problem is also present in the case of fsyncLocked secondaries. It seems that mongos is unable to work properly if one of the config servers (RS) secondaries is locked with db.fsyncLock(). I have tried running some write concern / read concern operations directly on the replica set while a secondary is locked that way and found no problem. Thus it must be the problem of the mongos alone.

is duplicated by

SERVER-24678 Allow select a CSRS node with the smallest maxStalenessMS value

Closed

is related to

SERVER-22627 ShardRegistry should mark hosts which failed due to OperationTimeout as faulty

Closed

Assignee:: Misha Tyulenev (Inactive)
Reporter:: Dmitry Ryabtsev
Participants:: Dmitry Ryabtsev, Githook User, Kaloian Manassiev, Matt Dannenberg, Misha Tyulenev, Pooja Gupta, Ramon Fernandez Marina, VenkataRamaRao Surapaneni
Votes:: 0 Vote for this issue
Watchers:: 17 Start watching this issue

Created:: Feb 16 2016 02:41:06 AM UTC
Updated:: Jan 03 2018 06:29:34 PM UTC
Resolved:: Aug 09 2016 10:51:13 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates