[SERVER-5418] High number of commands on replica set members in authenticated sharded setup Created: 27/Mar/12 Updated: 15/Aug/12 Resolved: 09/Apr/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Security, Sharding |
| Affects Version/s: | 2.0.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | László Bácsi | Assignee: | Randolph Tan |
| Resolution: | Duplicate | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Ubuntu 10.04, 64bit running on EC2 (replica sets on large instances, config servers and mongos on micro instances) |
||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Description |
|
We started to experience that our daily jobs which heavily use our MongoDB cluster took really long to finish. Looking at the output of mongostat on our replica sets we saw that the number of commands was much higher than the number of other operations. During a typical run of the daily jobs the query count would be around 4000 and the command count below 100 while other operation counts vary between hundreds and a few thousands. At this time the query count was 300-400 and the command count 1200-1600. Our mongodb cluster is authenticated and we use the same keyFile for all mongo instances. I started to log network communication with mongosniff which reported high number of 'need to login' errors between the replica set members and mongos instances. This has happened before and we could always solve the issue by restarting the mongos instances. I tried to do the same now but unfortunately it didn't help. I also restarted all other pieces of the cluster too but that didn't solve the issue either. The only difference between previous occasions and this one is that yesterday I upgraded all nodes from 2.0.2 to 2.0.4. Here's a piece of the mongosniff output:
|
| Comments |
| Comment by Randolph Tan [ 09/Apr/12 ] |
|
Reason for several replSetGetStatus: If slaveOk is on, ReplicaSetMonitor will try to check if it's ok to connect to secondary. And this check will be done only once until an error occured in that connection. However, ReplicaSetMonitor never authenticates the connection it uses when calling replSetGetStatus, so the check on secondary connections always fails and every thread that needs to talk to the secondary will keep on checking everytime it wants to talk with a secondary. |
| Comment by László Bácsi [ 30/Mar/12 ] |
|
Sorry, but unfortunately it's not possible. It looks like the log files have been truncated yesterday with the downgrade to 2.0.2. I could try to reproduce the issue again, but that would require us to go back to 2.0.4 which would only be possible in our weekly maintenance window. But I cannot promise that either because we have other priorities for that period. |
| Comment by Randolph Tan [ 30/Mar/12 ] |
|
Hi, Is it possible to have the logs for both the slower and the normal runs? Thanks! |
| Comment by László Bácsi [ 29/Mar/12 ] |
|
I can confirm that this doesn't affect 2.0.2. Since we couldn't find a solution to this on 2.0.4 I downgraded to 2.0.2 on all nodes. After that, command counts went back to normal. I thought I'd share since it might help find the root cause of the issue. |