[SERVER-5418] High number of commands on replica set members in authenticated sharded setup Created: 27/Mar/12  Updated: 15/Aug/12  Resolved: 09/Apr/12

Status: Closed
Project: Core Server
Component/s: Security, Sharding
Affects Version/s: 2.0.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: László Bácsi Assignee: Randolph Tan
Resolution: Duplicate Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 10.04, 64bit running on EC2 (replica sets on large instances, config servers and mongos on micro instances)


Issue Links:
Duplicate
duplicates SERVER-5405 mongos does not send reads to seconda... Closed
Operating System: ALL
Participants:

 Description   

We started to experience that our daily jobs which heavily use our MongoDB cluster took really long to finish. Looking at the output of mongostat on our replica sets we saw that the number of commands was much higher than the number of other operations.

During a typical run of the daily jobs the query count would be around 4000 and the command count below 100 while other operation counts vary between hundreds and a few thousands. At this time the query count was 300-400 and the command count 1200-1600.

Our mongodb cluster is authenticated and we use the same keyFile for all mongo instances.

I started to log network communication with mongosniff which reported high number of 'need to login' errors between the replica set members and mongos instances. This has happened before and we could always solve the issue by restarting the mongos instances. I tried to do the same now but unfortunately it didn't help. I also restarted all other pieces of the cluster too but that didn't solve the issue either.

The only difference between previous occasions and this one is that yesterday I upgraded all nodes from 2.0.2 to 2.0.4.

Here's a piece of the mongosniff output:

10.84.214.41:54537  -->> 10.84.214.41:27017 admin.$cmd  66 bytes  id:83b	2107
	query: { replSetGetStatus: 1 }  ntoreturn: 1 ntoskip: 0
10.118.63.28:27017  <<--  10.118.63.28:54537   79 bytes  id:1d74c	120652 - 2107
	reply n:1 cursorId: 0
	{ errmsg: "need to login", ok: 0.0 }
10.84.214.41:54537  -->> 10.84.214.41:27017 admin.$cmd  58 bytes  id:83c	2108
	query: { ismaster: 1 }  ntoreturn: 1 ntoskip: 0
10.118.63.28:27017  <<--  10.118.63.28:54537   324 bytes  id:1d74d	120653 - 2108
	reply n:1 cursorId: 0
	{ setName: "rs_a", ismaster: true, secondary: false, hosts: [ "mongodb-rsa1:27017", "mongodb-rsa2:27017" ], arbiters: [ "mongodb-c1:27018" ], primary: "mongodb-rsa1:27017", me: "mongodb-rsa1:27017", maxBsonObjectSize: 16777216, ok: 1.0 }
10.84.214.41:54537  -->> 10.84.214.41:27017 admin.$cmd  66 bytes  id:83d	2109
	query: { replSetGetStatus: 1 }  ntoreturn: 1 ntoskip: 0
10.118.63.28:27017  <<--  10.118.63.28:54537   79 bytes  id:1d74e	120654 - 2109
	reply n:1 cursorId: 0
	{ errmsg: "need to login", ok: 0.0 }
10.84.214.41:54537  -->> 10.84.214.41:27017 admin.$cmd  58 bytes  id:83e	2110
	query: { ismaster: 1 }  ntoreturn: 1 ntoskip: 0
10.118.63.28:27017  <<--  10.118.63.28:54537   324 bytes  id:1d74f	120655 - 2110
	reply n:1 cursorId: 0
	{ setName: "rs_a", ismaster: true, secondary: false, hosts: [ "mongodb-rsa1:27017", "mongodb-rsa2:27017" ], arbiters: [ "mongodb-c1:27018" ], primary: "mongodb-rsa1:27017", me: "mongodb-rsa1:27017", maxBsonObjectSize: 16777216, ok: 1.0 }
10.84.214.41:54537  -->> 10.84.214.41:27017 admin.$cmd  66 bytes  id:83f	2111
	query: { replSetGetStatus: 1 }  ntoreturn: 1 ntoskip: 0
10.118.63.28:27017  <<--  10.118.63.28:54537   79 bytes  id:1d750	120656 - 2111
	reply n:1 cursorId: 0
	{ errmsg: "need to login", ok: 0.0 }
10.244.73.3:44992  -->> 10.244.73.3:27017 admin.$cmd  58 bytes  id:1e426	123942
	query: { ismaster: 1 }  ntoreturn: 1 ntoskip: 0
10.118.63.28:27017  <<--  10.118.63.28:44992   324 bytes  id:1d751	120657 - 123942
	reply n:1 cursorId: 0
	{ setName: "rs_a", ismaster: true, secondary: false, hosts: [ "mongodb-rsa1:27017", "mongodb-rsa2:27017" ], arbiters: [ "mongodb-c1:27018" ], primary: "mongodb-rsa1:27017", me: "mongodb-rsa1:27017", maxBsonObjectSize: 16777216, ok: 1.0 }
10.244.73.3:44992  -->> 10.244.73.3:27017 admin.$cmd  66 bytes  id:1e427	123943
	query: { replSetGetStatus: 1 }  ntoreturn: 1 ntoskip: 0
10.118.63.28:27017  <<--  10.118.63.28:44992   79 bytes  id:1d752	120658 - 123943
	reply n:1 cursorId: 0
	{ errmsg: "need to login", ok: 0.0 }
10.244.73.3:44992  -->> 10.244.73.3:27017 admin.$cmd  58 bytes  id:1e428	123944
	query: { ismaster: 1 }  ntoreturn: 1 ntoskip: 0
10.118.63.28:27017  <<--  10.118.63.28:44992   324 bytes  id:1d753	120659 - 123944
	reply n:1 cursorId: 0
	{ setName: "rs_a", ismaster: true, secondary: false, hosts: [ "mongodb-rsa1:27017", "mongodb-rsa2:27017" ], arbiters: [ "mongodb-c1:27018" ], primary: "mongodb-rsa1:27017", me: "mongodb-rsa1:27017", maxBsonObjectSize: 16777216, ok: 1.0 }
10.244.73.3:44992  -->> 10.244.73.3:27017 admin.$cmd  66 bytes  id:1e429	123945
	query: { replSetGetStatus: 1 }  ntoreturn: 1 ntoskip: 0
10.118.63.28:27017  <<--  10.118.63.28:44992   79 bytes  id:1d754	120660 - 123945
	reply n:1 cursorId: 0
	{ errmsg: "need to login", ok: 0.0 }



 Comments   
Comment by Randolph Tan [ 09/Apr/12 ]

Reason for several replSetGetStatus: If slaveOk is on, ReplicaSetMonitor will try to check if it's ok to connect to secondary. And this check will be done only once until an error occured in that connection. However, ReplicaSetMonitor never authenticates the connection it uses when calling replSetGetStatus, so the check on secondary connections always fails and every thread that needs to talk to the secondary will keep on checking everytime it wants to talk with a secondary.

Comment by László Bácsi [ 30/Mar/12 ]

Sorry, but unfortunately it's not possible. It looks like the log files have been truncated yesterday with the downgrade to 2.0.2.

I could try to reproduce the issue again, but that would require us to go back to 2.0.4 which would only be possible in our weekly maintenance window. But I cannot promise that either because we have other priorities for that period.

Comment by Randolph Tan [ 30/Mar/12 ]

Hi,

Is it possible to have the logs for both the slower and the normal runs?

Thanks!

Comment by László Bácsi [ 29/Mar/12 ]

I can confirm that this doesn't affect 2.0.2. Since we couldn't find a solution to this on 2.0.4 I downgraded to 2.0.2 on all nodes. After that, command counts went back to normal.

I thought I'd share since it might help find the root cause of the issue.

Generated at Thu Feb 08 03:08:51 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.