[SERVER-3830] replSetGetStatus thousands of time per second from mongos Created: 13/Sep/11  Updated: 11/Jul/16  Resolved: 15/Sep/11

Status: Closed
Project: Core Server
Component/s: Replication, Sharding
Affects Version/s: 1.8.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Otto Bergström Assignee: Spencer Brody (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Sharded cluster with three shards, each shard is a replica set consisting of two replicas and one arbiter runnning Ubuntu Linux. The Balancer is turned off.


Attachments: Text File mongos_log.log     Text File printsharding.txt     Text File rs.status.txt    
Operating System: Linux
Participants:

 Description   

We and the replica originally defined when adding the shard (with the db.runCommand(

{ addshard : "replicaset/hostname" }

); command) is primary mongos seems to execute a replSetGetStatus for each insert or query it executes.
We are doing several thousands of inserts per second ant this behavior causes a severe performance impact. When switching primarys to a replica discovered by mongos this behavior stops and the performance go back to normal. This behavior is consitent trhougout the cluster on all three shards.

This is an output from mongostat when the primary is the server originally defined when adding the shard in mongos:

insert query update delete getmore command
553 195 0 0 175 1285

This is from mongostat when the other server is primary:

insert query update delete getmore command
607 279 0 0 175 196

Notice the vast difference in command/s.



 Comments   
Comment by Spencer Brody (Inactive) [ 15/Sep/11 ]

Looks like you're hitting SERVER-3605. This has been fixed in 2.0 and will be backported for 1.8.4. As a temporary wor-around for now, restarting mongos should clear things up.

Comment by Otto Bergström [ 15/Sep/11 ]

Sorry, here is the rs.status output:
{
"set" : "richcollshard3",
"date" : ISODate("2011-09-15T17:08:53Z"),
"myState" : 1,
"members" : [
{
"_id" : 0,
"name" : "richcolldb05.byburt.com:27017",
"health" : 0,
"state" : 1,
"stateStr" : "(not reachable/healthy)",
"uptime" : 0,
"optime" :

{ "t" : 1316106496000, "i" : 121 }

,
"optimeDate" : ISODate("2011-09-15T17:08:16Z"),
"lastHeartbeat" : ISODate("2011-09-15T17:08:16Z"),
"errmsg" : "socket exception"
},
{
"_id" : 1,
"name" : "richcolldb06",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"optime" :

{ "t" : 1316106533000, "i" : 475 }

,
"optimeDate" : ISODate("2011-09-15T17:08:53Z"),
"self" : true
},
{
"_id" : 2,
"name" : "richcollarb03",
"health" : 1,
"state" : 7,
"stateStr" : "ARBITER",
"uptime" : 22731,
"optime" :

{ "t" : 0, "i" : 0 }

,
"optimeDate" : ISODate("1970-01-01T00:00:00Z"),
"lastHeartbeat" : ISODate("2011-09-15T17:08:52Z")
}
],
"ok" : 1
}

Comment by Spencer Brody (Inactive) [ 15/Sep/11 ]

I don't see any calls to replSetGetStatus in the mongos logs - where were you seeing this?

Can you also attach the mongod logs from the primary and secondary where you see the problem (if you have them)?

Comment by Spencer Brody (Inactive) [ 15/Sep/11 ]

both rs.status.txt and printsharding.txt contain the same output from printShardingStatus(). Do you have the rs.status() output?

Comment by Otto Bergström [ 15/Sep/11 ]

I have now attached the information requested, sorry for taking such a long time but i have to wait until the problem appears and to not want to force it since we are in production.

Comment by Otto Bergström [ 15/Sep/11 ]

Attached are the outputs requested. The troublesome node is richcolldb06.

Comment by Spencer Brody (Inactive) [ 13/Sep/11 ]

Can you please attach:
The output of db.printShardingStatus() on the mongos
The output of rs.status() from the primary on a shard where you're seeing the error
The mongos logs from when you see this problem.

Generated at Thu Feb 08 03:04:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.