[SERVER-22392] "Operation timed out" when enableSharding on new cluster Created: 01/Feb/16  Updated: 18/Apr/16  Resolved: 05/Mar/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.2.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Johannes Ziemke Assignee: Spencer Brody (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File mongod-config-server-logs.tar.gz    
Issue Links:
Related
is related to SERVER-22427 Add additional diagnostics to "Operat... Closed
is related to SERVER-22428 Log read-after-optime timeouts Closed
Operating System: ALL
Steps To Reproduce:

1. Create 6 replication sets with 3 nodes each
2. Set clusterRole to configsvr on one shard to make it config RS
3. Set clusterRole on others to shardsvr
3. Run rs.initiate() on 'first' node of each shard
4. Run db.CreateUser to create admin user (role: "root") on 'first' node of each shard
5. Run mongos pointing to config servers
6. Connect to mongos, add shards
7. Try to enable sharding `sh.enableSharding("my-db")`

Sprint: Sharding 10 (02/19/16), Sharding 11 (03/11/16)
Participants:
Linked BF Score: 0

 Description   

Given a mongodb 3.2.1 cluster with multiple shards, enabling sharding for a database yields "Operation timed out":

mongos>   sh.enableSharding("my-db")
{ "ok" : 0, "errmsg" : "Operation timed out", "code" : 50 }

Adding shards worked just fine but sh.status() also shows timeout issues:

mongos> sh.status(true)
--- Sharding Status --- 
  sharding version: {
	"_id" : 1,
	"minCompatibleVersion" : 5,
	"currentVersion" : 6,
	"clusterId" : ObjectId("56af2c8c9a517ed41619c719")
}
  shards:
	{  "_id" : "replA",  "host" : "replA/replA-0.mongodb-shardsvr.service.q:27017,replA-1.mongodb-shardsvr.service.q:27017,replA-2.mongodb-shardsvr.service.q:27017" }
	{  "_id" : "replB",  "host" : "replB/replB-0.mongodb-shardsvr.service.q:27017,replB-1.mongodb-shardsvr.service.q:27017,replB-2.mongodb-shardsvr.service.q:27017" }
	{  "_id" : "replC",  "host" : "replC/replC-0.mongodb-shardsvr.service.q:27017,replC-1.mongodb-shardsvr.service.q:27017,replC-2.mongodb-shardsvr.service.q:27017" }
	{  "_id" : "replD",  "host" : "replD/replD-0.mongodb-shardsvr.service.q:27017,replD-1.mongodb-shardsvr.service.q:27017,replD-2.mongodb-shardsvr.service.q:27017" }
	{  "_id" : "replE",  "host" : "replE/replE-0.mongodb-shardsvr.service.q:27017,replE-1.mongodb-shardsvr.service.q:27017,replE-2.mongodb-shardsvr.service.q:27017" }
  active mongoses:
	{  "_id" : "ip-10-1-45-192:27018",  "ping" : ISODate("2016-02-01T11:12:25.013Z"),  "up" : NumberLong(4363),  "waiting" : false,  "mongoVersion" : "3.2.1" }
	{  "_id" : "ip-10-1-45-220:27018",  "ping" : ISODate("2016-02-01T11:12:24.552Z"),  "up" : NumberLong(4363),  "waiting" : false,  "mongoVersion" : "3.2.1" }
	{  "_id" : "ip-10-1-47-28:27018",  "ping" : ISODate("2016-02-01T11:12:24.434Z"),  "up" : NumberLong(4362),  "waiting" : false,  "mongoVersion" : "3.2.1" }
	{  "_id" : "ip-10-1-40-26:27018",  "ping" : ISODate("2016-02-01T11:12:24.251Z"),  "up" : NumberLong(4363),  "waiting" : false,  "mongoVersion" : "3.2.1" }
	{  "_id" : "ip-10-1-24-168:27018",  "ping" : ISODate("2016-02-01T11:12:24.050Z"),  "up" : NumberLong(4362),  "waiting" : false,  "mongoVersion" : "3.2.1" }
	{  "_id" : "ip-10-1-29-10:27017",  "ping" : ISODate("2016-02-01T11:12:24.047Z"),  "up" : NumberLong(4362),  "waiting" : false,  "mongoVersion" : "3.2.1" }
	{  "_id" : "ip-10-1-24-208:27018",  "ping" : ISODate("2016-02-01T11:12:24.044Z"),  "up" : NumberLong(4362),  "waiting" : false,  "mongoVersion" : "3.2.1" }
	{  "_id" : "ip-10-1-11-238:27017",  "ping" : ISODate("2016-02-01T11:12:23.750Z"),  "up" : NumberLong(4361),  "waiting" : false,  "mongoVersion" : "3.2.1" }
	{  "_id" : "ip-10-1-46-222:27018",  "ping" : ISODate("2016-02-01T11:12:23.686Z"),  "up" : NumberLong(4362),  "waiting" : false,  "mongoVersion" : "3.2.1" }
	{  "_id" : "ip-10-1-27-15:27018",  "ping" : ISODate("2016-02-01T11:12:23.559Z"),  "up" : NumberLong(4361),  "waiting" : false,  "mongoVersion" : "3.2.1" }
	{  "_id" : "ip-10-1-45-146:27018",  "ping" : ISODate("2016-02-01T11:12:23.529Z"),  "up" : NumberLong(4362),  "waiting" : false,  "mongoVersion" : "3.2.1" }
	{  "_id" : "ip-10-1-26-190:27018",  "ping" : ISODate("2016-02-01T11:12:23.325Z"),  "up" : NumberLong(4361),  "waiting" : false,  "mongoVersion" : "3.2.1" }
	{  "_id" : "ip-10-1-28-139:27018",  "ping" : ISODate("2016-02-01T11:12:23.253Z"),  "up" : NumberLong(4362),  "waiting" : false,  "mongoVersion" : "3.2.1" }
	{  "_id" : "ip-10-1-8-127:27018",  "ping" : ISODate("2016-02-01T11:12:23.206Z"),  "up" : NumberLong(4362),  "waiting" : false,  "mongoVersion" : "3.2.1" }
	{  "_id" : "ip-10-1-25-107:27018",  "ping" : ISODate("2016-02-01T11:12:23.206Z"),  "up" : NumberLong(4362),  "waiting" : false,  "mongoVersion" : "3.2.1" }
	{  "_id" : "ip-10-1-8-239:27018",  "ping" : ISODate("2016-02-01T11:12:22.988Z"),  "up" : NumberLong(4360),  "waiting" : false,  "mongoVersion" : "3.2.1" }
	{  "_id" : "ip-10-1-11-26:27018",  "ping" : ISODate("2016-02-01T11:12:22.986Z"),  "up" : NumberLong(4361),  "waiting" : false,  "mongoVersion" : "3.2.1" }
	{  "_id" : "ip-10-1-10-142:27018",  "ping" : ISODate("2016-02-01T11:12:22.970Z"),  "up" : NumberLong(4361),  "waiting" : false,  "mongoVersion" : "3.2.1" }
	{  "_id" : "ip-10-1-10-41:27018",  "ping" : ISODate("2016-02-01T11:12:22.131Z"),  "up" : NumberLong(4360),  "waiting" : false,  "mongoVersion" : "3.2.1" }
	{  "_id" : "ip-10-1-11-229:27018",  "ping" : ISODate("2016-02-01T11:12:16.815Z"),  "up" : NumberLong(3640),  "waiting" : false,  "mongoVersion" : "3.2.1" }
  balancer:
	Currently enabled:  yes
	Currently running:  no
	Failed balancer rounds in last 5 attempts:  5
	Last reported error:  could not get updated shard list from config server due to ExceededTimeLimit Operation timed out
	Time of Reported error:  Mon Feb 01 2016 11:12:14 GMT+0000 (UTC)
	Migration Results for the last 24 hours: 
		No recent migrations
  databases:

On the config server RS, rs.status() looks good:

configReplSet:PRIMARY> rs.status()
{
	"set" : "configReplSet",
	"date" : ISODate("2016-02-01T11:15:52.962Z"),
	"myState" : 1,
	"term" : NumberLong(1),
	"configsvr" : true,
	"heartbeatIntervalMillis" : NumberLong(2000),
	"members" : [
		{
			"_id" : 0,
			"name" : "configReplSet-0.mongodb-configsvr.service.q:27017",
			"health" : 1,
			"state" : 1,
			"stateStr" : "PRIMARY",
			"uptime" : 4588,
			"optime" : {
				"ts" : Timestamp(1454325351, 1),
				"t" : NumberLong(1)
			},
			"optimeDate" : ISODate("2016-02-01T11:15:51Z"),
			"electionTime" : Timestamp(1454320777, 1),
			"electionDate" : ISODate("2016-02-01T09:59:37Z"),
			"configVersion" : 1,
			"self" : true
		},
		{
			"_id" : 1,
			"name" : "configReplSet-1.mongodb-configsvr.service.q:27017",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"uptime" : 4586,
			"optime" : {
				"ts" : Timestamp(1454325348, 1),
				"t" : NumberLong(1)
			},
			"optimeDate" : ISODate("2016-02-01T11:15:48Z"),
			"lastHeartbeat" : ISODate("2016-02-01T11:15:51.718Z"),
			"lastHeartbeatRecv" : ISODate("2016-02-01T11:15:51.271Z"),
			"pingMs" : NumberLong(0),
			"syncingTo" : "configReplSet-0.mongodb-configsvr.service.q:27017",
			"configVersion" : 1
		},
		{
			"_id" : 2,
			"name" : "configReplSet-2.mongodb-configsvr.service.q:27017",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"uptime" : 4586,
			"optime" : {
				"ts" : Timestamp(1454325351, 1),
				"t" : NumberLong(1)
			},
			"optimeDate" : ISODate("2016-02-01T11:15:51Z"),
			"lastHeartbeat" : ISODate("2016-02-01T11:15:51.883Z"),
			"lastHeartbeatRecv" : ISODate("2016-02-01T11:15:51.774Z"),
			"pingMs" : NumberLong(0),
			"syncingTo" : "configReplSet-0.mongodb-configsvr.service.q:27017",
			"configVersion" : 1
		}
	],
	"ok" : 1
}

All this was working when I did the same on 3.2.0 without authentication. So it's either related to the authentication or a regression in 3.2.1. I'll try to reproduce this with 3.2.1 now.



 Comments   
Comment by Kelsey Schubert [ 18/Apr/16 ]

Hi joerg.rech,

Thank you reporting this issue. Can you please open a new ticket so we can investigate the behavior you are describing?

Kind regards,
Thomas

Comment by Jörg Rech [ 18/Apr/16 ]

I have the same problem with MongoDB 3.2.5 on Microsoft Azure with large collections (> 1 Billion documents; >2 TB compressed) for commands such as sh.shardCollection ({{ "code" : 50, "ok" : 0, "errmsg" : "Operation timed out" }}), db.count and have problem during balancing Last reported error: could not get updated shard list from config server due to ExceededTimeLimit Operation timed out. Is there a way to increase the timeout in the mongodb config?

Comment by Ramon Fernandez Marina [ 05/Mar/16 ]

fish, after your last successful attempt we haven't heard back from you, so I assume you haven't had any further issues. I'm therefore going to close this ticket for the time being, but if the same issue reappears please post back and we can reopen. If you run into some other issues please open a new ticket.

Note also that 3.2.3 is out, and 3.2.4 is scheduled for next week (with a 3.2.4-rc0 release candidate already published), so you may want to consider upgrading since these releases include a number of fixes that your configuration could benefit from.

Regards,
Ramón.

Comment by Johannes Ziemke [ 04/Feb/16 ]

Ah darn.. now I created it again (the 3rd time) and now it seems to work.. I'll continue with data import and maybe run into this/similar issues.

Comment by Johannes Ziemke [ 04/Feb/16 ]

Hi Spencer,

1) Yes, we're using packer to build mongodb AMIs and cloudformation to deploy it:

I'll create a new mongodb 3.2.1 cluster now and provide the other informations soon.

Comment by Spencer Brody (Inactive) [ 01/Feb/16 ]

Hi Johannes,
Thank you for filing this report. We are investigating the issue you describe, but have unfortunately been unable to reproduce it. I'd like to get some more information from you to aid in our diagnostic efforts:

  • You say you are able to reproduce this fairly reliably when setting up a new cluster. When you do so, are you setting up the cluster by hand, or do you have a script or some other machinery for automating this? If you do have a script, would you be able to share it with us so we can see exactly how you're setting up the cluster so we can attempt the same thing in our reproduction attempts?
  • Can you please attach the results of running "rs.status()" against one of the nodes in the config server replica set? I'm interested in seeing if you have any replication lag on the secondaries.
  • Can you please also attach the results of running "db.serverStatus()" against all 3 config servers as well as the mongos you are talking to when you encounter this problem?
  • If possible, could you attach a dump of the oplog on the config server primary? You should be able to create one by running "mongodump -d local -c oplog.rs" against the config server primary.

If any of this information I've requested is sensitive or contains things you'd rather not have attached to this (public) ticket, feel free to file a new ticket in the Community Private (SUPPORT) project with the information I requested and link it here. Tickets in that project will only be viewable by you and engineers at MongoDB.

Comment by Johannes Ziemke [ 01/Feb/16 ]

Okay, so I can reliably reproduce this with 3.2.1. I've created the cluster like 3-4 times, always running into this issue, although sometimes sh.enableSharding("my-db") still works and it fails on sh.shardCollection. Sometimes sh.status() fails as well, sometimes it works.

I've also tried to reproduce this with 3.2.0 but can't, so to me it really looks like a regression unless there were some breaking changes between 3.2.0 and 3.2.1.

Generated at Thu Feb 08 04:00:16 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.