[SERVER-22273] Mongos not work correctly Created: 22/Jan/16  Updated: 06/Apr/23  Resolved: 01/Feb/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Ali Hallaji Assignee: Ramon Fernandez Marina
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-22427 Add additional diagnostics to "Operat... Closed
Operating System: ALL
Participants:

 Description   

We deploy sharded cluster between multi locations(Site or Data center).

issue: Mongos don't return any thing, and It show this message

[thread1] Error: error: { "$err" : "Operation timed out", "code" : 50 } :

I describe all details about this sharded cluster at below:

  1. mongodb version: 3.2.1
  2. QTY of locations(or Data center): 4 location to the names (thr, ifn, mhd, bnd)
  3. The location "thr" is main datacenter, Then the thr must be majority on replica set.
  4. We deploy the replica set for configsvr, and We put 4 members into thr, 4 into mhd, 4 into ifn and one member into bnd.
  5. The thr is majority, Then 3 of members is votes=1 and other is zero
  6. Other locations(mhd,ifn), They have just one member for votes=1
  7. the bnd location just one secondery with votes=0
  8. each location have one replica set(3 members) for sharding.
  9. We use tag aware for replica set
  10. We have 4 mongos, 4 replica set(3 member for two replica set, 2member for other replica set), 1 replica set configsvr(13 member)

The result of rs.config() from configsvr:

{
	"_id" : "iran_cfg",
	"version" : 6,
	"configsvr" : true,
	"protocolVersion" : NumberLong(1),
	"members" : [
		{
			"_id" : 1,
			"host" : "thr-cfg01:37017",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 3,
			"tags" : {
				
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		},
		{
			"_id" : 2,
			"host" : "thr-cfg02:37017",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 2,
			"tags" : {
				
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		},
		{
			"_id" : 3,
			"host" : "thr-cfg03:37017",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 1,
			"tags" : {
				
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		},
		{
			"_id" : 4,
			"host" : "thr-cfg04:37017",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 0,
			"tags" : {
				
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 0
		},
		{
			"_id" : 5,
			"host" : "mhd-cfg01:37017",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 0,
			"tags" : {
				
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 0
		},
		{
			"_id" : 6,
			"host" : "mhd-cfg02:37017",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 0,
			"tags" : {
				
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 0
		},
		{
			"_id" : 7,
			"host" : "mhd-cfg03:37017",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 1,
			"tags" : {
				
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		},
		{
			"_id" : 8,
			"host" : "mhd-cfg04:37017",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 0,
			"tags" : {
				
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 0
		},
		{
			"_id" : 9,
			"host" : "ifn-cfg01:37017",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 0,
			"tags" : {
				
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 0
		},
		{
			"_id" : 10,
			"host" : "ifn-cfg02:37017",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 0,
			"tags" : {
				
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 0
		},
		{
			"_id" : 11,
			"host" : "ifn-cfg03:37017",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 1,
			"tags" : {
				
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		},
		{
			"_id" : 12,
			"host" : "ifn-cfg04:37017",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 0,
			"tags" : {
				
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 0
		},
		{
			"_id" : 13,
			"host" : "bnd-db03:37017",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 0,
			"tags" : {
				
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 0
		}
	],
	"settings" : {
		"chainingAllowed" : true,
		"heartbeatIntervalMillis" : 2000,
		"heartbeatTimeoutSecs" : 10,
		"electionTimeoutMillis" : 10000,
		"getLastErrorModes" : {
			
		},
		"getLastErrorDefaults" : {
			"w" : 1,
			"wtimeout" : 0
		}
	}
}

We have 4 mongos, But most of mongos not work correctly, and When I send command into router, my command(sh.status(), sh.add, show collections) will be freeze or crash.

The config file of all configsvr:

# mongod.conf
 
# for documentation of all options, see:
#   http://docs.mongodb.org/manual/reference/configuration-options/
 
# Where and how to store data.
storage:
  dbPath: /var/lib/mongodb
  journal:
    enabled: true
#  engine:
#  mmapv1:
#  wiredTiger:
 
# where to write logging data.
systemLog:
  destination: file
  verbosity: 5
  traceAllExceptions: true
  logRotate: rename
  logAppend: true
  component:
      sharding:
         verbosity: 5
      command:
         verbosity: 5
      replication:
         verbosity: 5
      network:
         verbosity: 5
 
 
  path: /var/log/mongodb/mongod.log
 
# network interfaces
net:
  port: 37017
 
sharding:
   clusterRole: configsvr
 
replication:
   replSetName: iran_cfg

The config file of shardsvr:

# mongod.conf
 
# for documentation of all options, see:
#   http://docs.mongodb.org/manual/reference/configuration-options/
 
# Where and how to store data.
storage:
  dbPath: /var/lib/mongodb
  journal:
    enabled: true
#  engine:
#  mmapv1:
#  wiredTiger:
 
# where to write logging data.
systemLog:
  destination: file
  verbosity: 5
  traceAllExceptions: true
  logRotate: rename
  logAppend: true
  component:
      sharding:
         verbosity: 5
      command:
         verbosity: 5
      replication:
         verbosity: 5
      network:
         verbosity: 5
 
  path: /var/log/mongodb/mongod.log
 
# network interfaces
net:
  port: 27018
 
sharding:
   clusterRole: shardsvr
 
# thr,mhd,ifn, ...
replication:
   replSetName: bnd
 
# processManagement:
#    fork: true

one of mongos config file:

# mongod.conf
 
# for documentation of all options, see:
#   http://docs.mongodb.org/manual/reference/configuration-options/
 
# Where and how to store data.
#  engine:
#  mmapv1:
#  wiredTiger:
 
# where to write logging data.
systemLog:
  destination: file
  traceAllExceptions: true
  logRotate: rename
  quiet : true
  logAppend: true
  component:
      network:
         verbosity: 5
  path: /var/log/mongodb/mongod.log
 
sharding:
    configDB: "iran_cfg/thr-cfg01:37017,thr-cfg02:37017,mhd-cfg03:37017,ifn-cfg03:37017"
    autoSplit: true
 
net:
    port: 27017

The result from sh.status():

--- Sharding Status --- 
  sharding version: {
	"_id" : 1,
	"minCompatibleVersion" : 5,
	"currentVersion" : 6,
	"clusterId" : ObjectId("56a0f7da64fe06bcccb03350")
}
  shards:
	{  "_id" : "bnd",  "host" : "bnd/bnd-db01:27018,bnd-db02:27018" }
	{  "_id" : "ifn",  "host" : "ifn/ifn-db01:27018,ifn-db02:27018,ifn-db03:27018" }
	{  "_id" : "mhd",  "host" : "mhd/mhd-db01:27018,mhd-db02:27018,mhd-db03:27018" }
	{  "_id" : "thr",  "host" : "thr/thr-db01:27018,thr-db02:27018,thr-db03:27018" }
  active mongoses:
	{  "_id" : "db04-thr-srv:27017",  "ping" : ISODate("2016-01-22T12:29:02.648Z"),  "up" : NumberLong(320),  "waiting" : false,  "mongoVersion" : "3.2.1" }
	{  "_id" : "db04-ifn-srv:27017",  "ping" : ISODate("2016-01-22T12:28:05.566Z"),  "up" : NumberLong(75752),  "waiting" : false,  "mongoVersion" : "3.2.1" }
  balancer:
	Currently enabled:  yes
	Currently running:  no
	Failed balancer rounds in last 5 attempts:  5
	Last reported error:  could not get updated shard list from config server due to ExceededTimeLimit Operation timed out
	Time of Reported error:  Fri Jan 22 2016 15:57:15 GMT+0330 (IRST)
	Migration Results for the last 24 hours: 
		No recent migrations
  databases:

I'm sorry for too busy information for this issue.



 Comments   
Comment by Ramon Fernandez Marina [ 01/Feb/16 ]

Thanks for the update Ali.Hallaji, we'll close this ticket then.

fish, thanks for opening a new ticket.

Comment by Johannes Ziemke [ 01/Feb/16 ]

Ok, I've opened https://jira.mongodb.org/browse/SERVER-22392

Comment by Ali Hallaji [ 31/Jan/16 ]

Hi Ramon Fernandez ,
I install again all mongod and I reconfigure from scratch, Then it's ok.
But I don't know, How .I think they must be initiated together for the first.
Now I have other problem :
"errmsg" : "could not contact primary for replica set thr"

I create other issue for this problem.
Thank you from all.

Best Regards,
Ali Hallaji

Comment by Ramon Fernandez Marina [ 29/Jan/16 ]

ali.hallaji1@gmail.com, the hypothesis we're working on is that one/some of your CSRS nodes may be unreachable (or taking too long to respond), but I tried replicating that on my end and my mongos were working correctly.

Can you please upload logs for your CSRS primary and at least one of the affected mongos?

Thanks,
Ramón.

Comment by Kelsey Schubert [ 29/Jan/16 ]

Hi fish,

Thank you for the report. Can you please open a new ticket so we can continue to investigate?

When you open a new ticket please include the following information:

  1. The logs from the mongos that is timing out.
  2. The logs from each node of config replicaset

Thank you,
Thomas

Comment by Johannes Ziemke [ 29/Jan/16 ]

Some more general details: It's deployed via cloudformation with 5 shards (each replication set) and config server RS. Each RS has 3 nodes, each in the different AWS AZ but all in the same region.

Comment by Johannes Ziemke [ 29/Jan/16 ]

I have the same problem here with 3.2.1 and wondering if it might be a regression.
I'm using CloudFormation and static AMIs. I've created the current cluster state a few weeks ago after which I upgraded mongodb to 3.2.1 without any trouble.
Now I wanted to start over and recreate the cluster from scratch using the 3.2.1 AMI but ran into the same issue: I can connect to mongos just fine and also add shards, but as soon as I enable sharding for a collection I get:

{ "ok" : 0, "errmsg" : "Operation timed out", "code" : 50 }

Same for sh.status(). In the mongodb log I see:

2016-01-29_18:01:58.85961 2016-01-29T18:01:58.859+0000 I SHARDING [Balancer] caught exception while doing balance: could not get updated shard list from 
config server due to ExceededTimeLimit Operation timed out

One thing I also realize is an issue with the localhost auth exception: I've tried to create a user but it failed:

Error: couldn't add user: not authorized on admin to execute command { createUser: "root", pwd: "xxx", roles: [ { role: "root", db: "admin" } ], 
digestPassword: false, writeConcern: { w: "majority", wtimeout: 30000.0 } }

The odd thing: After that, I still was able to auth. So maybe there was already some inconsistency going on. I'll try to see if the same issue still happen if I create it with 3.2.0.

Generated at Thu Feb 08 03:59:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.