[SERVER-4118] mongos causes dos by opening a ton of connections Created: 21/Oct/11  Updated: 11/Jul/16  Resolved: 08/Dec/11

Status: Closed
Project: Core Server
Component/s: Performance, Sharding, Stability
Affects Version/s: 2.0.0-rc2
Fix Version/s: 2.0.2

Type: Bug Priority: Major - P3
Reporter: Y. Wayne Huang Assignee: Greg Studer
Resolution: Done Votes: 2
Labels: mongos
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

ubuntu 10.04 x86_64; 4 shards, each shard a RS of two members + arbiter


Attachments: File mongodb.scrubbed.log.gz    
Issue Links:
Depends
depends on SERVER-4240 make 2.0.1 mongos nightly compatible ... Closed
Related
related to SERVER-3683 Possible for setShardVersion to never... Closed
Operating System: Linux
Participants:

 Description   

Comment for future reference - this ticket was used primarily to track a writeback listener change in 2.0.1 - the wbl was not forcing a version reload which would reset the connection version on the mongod side, even after multiple failed retries. The changes in this ticket fix that.

this issue may be similar to or related to SERVER-4037. previously, we have noted that mongos will enter a state in which it is attempting to retry a command query indefinitely to some shard members. prior to 2.0.0-rc2, this resulted in mongos outputting a bunch of 'writeback failed' messages. this no longer occurs, but we still see behavior which cause the number of command ops/sec on shards to increase to 1500-2000 ops/sec range. note, these command ops are not logged at the shards but we definitely see them when running mongostat. bouncing mongos fixes the issue.

on the shards, we can see that mongos effectively has caused a dos by issuing many quick successive connections:

Fri Oct 21 11:50:49 [initandlisten] connection accepted from mongos:35539 #9663
Fri Oct 21 11:50:50 [initandlisten] connection accepted from mongos:35543 #9664
Fri Oct 21 11:50:51 [initandlisten] connection accepted from mongos:35544 #9665
Fri Oct 21 11:50:51 [initandlisten] connection accepted from mongos:35545 #9666
Fri Oct 21 11:50:53 [initandlisten] connection accepted from mongos:35546 #9667
Fri Oct 21 11:51:01 [initandlisten] connection accepted from mongos:35547 #9668
Fri Oct 21 11:51:02 [initandlisten] connection accepted from mongos:35551 #9669
Fri Oct 21 11:51:02 [initandlisten] connection accepted from mongos:35552 #9670
Fri Oct 21 11:51:03 [initandlisten] connection accepted from mongos:35553 #9671
Fri Oct 21 11:51:03 [initandlisten] connection accepted from mongos:35554 #9672
Fri Oct 21 11:51:05 [initandlisten] connection accepted from mongos:35558 #9673
Fri Oct 21 11:51:12 [initandlisten] connection accepted from mongos:35559 #9674
Fri Oct 21 11:51:13 [initandlisten] connection accepted from mongos:35560 #9675
Fri Oct 21 11:51:13 [initandlisten] connection accepted from mongos:35561 #9676
Fri Oct 21 11:51:14 [initandlisten] connection accepted from mongos:35565 #9677
Fri Oct 21 11:51:14 [initandlisten] connection accepted from mongos:35569 #9678
Fri Oct 21 11:51:16 [initandlisten] connection accepted from mongos:35570 #9679
Fri Oct 21 11:51:18 [initandlisten] connection accepted from mongos:35571 #9680
Fri Oct 21 11:51:18 [initandlisten] connection accepted from mongos:35572 #9681
Fri Oct 21 11:51:18 [initandlisten] connection accepted from mongos:35573 #9682
Fri Oct 21 11:51:19 [initandlisten] connection accepted from mongos:35577 #9683
Fri Oct 21 11:51:19 [initandlisten] connection accepted from mongos:35580 #9684
Fri Oct 21 11:51:19 [initandlisten] connection accepted from mongos:35581 #9685
Fri Oct 21 11:51:21 [initandlisten] connection accepted from mongos:35585 #9686
Fri Oct 21 11:51:22 [initandlisten] connection accepted from mongos:35586 #9687
Fri Oct 21 11:51:22 [initandlisten] connection accepted from mongos:35587 #9688
Fri Oct 21 11:51:29 [initandlisten] connection accepted from mongos:35591 #9689
Fri Oct 21 11:51:30 [initandlisten] connection accepted from mongos:35592 #9690
Fri Oct 21 11:51:31 [initandlisten] connection accepted from mongos:35593 #9691
Fri Oct 21 11:51:32 [initandlisten] connection accepted from mongos:35594 #9692
Fri Oct 21 11:51:34 [initandlisten] connection accepted from mongos:35595 #9693
Fri Oct 21 11:51:37 [initandlisten] connection accepted from mongos:35596 #9694
Fri Oct 21 11:51:38 [initandlisten] connection accepted from mongos:35597 #9695
Fri Oct 21 11:51:38 [initandlisten] connection accepted from mongos:35598 #9696
Fri Oct 21 11:51:40 [initandlisten] connection accepted from mongos:35599 #9697
Fri Oct 21 11:51:40 [initandlisten] connection accepted from mongos:35600 #9698
Fri Oct 21 11:51:41 [initandlisten] connection accepted from mongos:35601 #9699
Fri Oct 21 11:51:42 [initandlisten] connection accepted from mongos:35602 #9700
Fri Oct 21 11:51:42 [initandlisten] connection accepted from mongos:35603 #9701
Fri Oct 21 11:51:42 [initandlisten] connection accepted from mongos:35604 #9702
Fri Oct 21 11:51:43 [initandlisten] connection accepted from mongos:35605 #9703
Fri Oct 21 11:51:44 [initandlisten] connection accepted from mongos:35609 #9704
...

the load on this shard RS member increased to 15 and became unresponsive. bouncing mongos stopped the connections/queries and service returned to normal. during the initial stages of the problem, performance degraded severely eventually leading to all queries timing out.



 Comments   
Comment by Greg Studer [ 08/Dec/11 ]

np - definitely let us know if you see this again or any other issue. Closing for now - feel free to reopen if you see this issue again.

Comment by Y. Wayne Huang [ 08/Dec/11 ]

we're running 2.0.2 rc1 for mongos and 2.0.1 for mongod and so far so good. thanks for the fix

Comment by Greg Studer [ 10/Nov/11 ]

Should be within a week for rc0. Just pushed some backwards-compatibility changes as well.

Comment by Greg Studer [ 09/Nov/11 ]

Not exactly sure, having a scheduling meeting tomorrow, so will update. We need to do more testing of the final branch once we freeze it too.

Comment by Y. Wayne Huang [ 09/Nov/11 ]

when can we expect 2.0.2?

Comment by Greg Studer [ 09/Nov/11 ]

> Wed Nov 9 12:24:42 [conn7] warning: adding shard sub-connection shard1/a:27011,b:27012 (parent shard1/a:27011,b:27012) as sharded, this is safe but unexpected
This is safe, just too-verbose conn tracking messaging, been removed in 2.0.2. It should only occur more than once at high verbosity levels.

> could not initialize cursor across all shards because : could not initialize sharding on connection shard3/x:27017,y:27017 :: caused by :: need to specify namespace
Apologize for this - the new connection initialization in the new nightly requires new mongod as well, as the protocol has slightly changed. Will make sure this is more seamless for the release.

Comment by Y. Wayne Huang [ 09/Nov/11 ]

also see this in the mongos log:

Wed Nov 9 12:24:42 [conn7] warning: adding shard sub-connection shard1/a:27011,b:27012 (parent shard1/a:27011,b:27012) as sharded, this is safe but unexpected
0x5c5fe1 0x5c37d5 0x5c3c5c 0x7967ae 0x796e5e 0x79088f 0x76a6bb 0x7b1627 0x7c37c1 0x5e9557 0x7fe99da279ca 0x7fe99cfd670d
/usr/bin/mongos(_ZN5mongo17ClientConnections3getERKSsS2_b+0x571) [0x5c5fe1]
/usr/bin/mongos(_ZN5mongo15ShardConnection5_initEb+0x65) [0x5c37d5]
/usr/bin/mongos(_ZN5mongo15ShardConnectionC1ERKNS_5ShardERKSsb+0x8c) [0x5c3c5c]
/usr/bin/mongos(_ZN5mongo15dbgrid_pub_cmds17PublicGridCommand12_passthroughERKSsN5boost10shared_ptrINS_8DBConfigEEERKNS_7BSONObjEiRNS_14BSONObjBuilderE+0x15e) [0x7967ae]
/usr/bin/mongos(_ZN5mongo15dbgrid_pub_cmds32NotAllowedOnShardedCollectionCmd3runERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x13e) [0x796e5e]
/usr/bin/mongos(_ZN5mongo7Command20runAgainstRegisteredEPKcRNS_7BSONObjERNS_14BSONObjBuilderEi+0x89f) [0x79088f]
/usr/bin/mongos(_ZN5mongo14SingleStrategy7queryOpERNS_7RequestE+0x5cb) [0x76a6bb]
/usr/bin/mongos(_ZN5mongo7Request7processEi+0x187) [0x7b1627]
/usr/bin/mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x71) [0x7c37c1]
/usr/bin/mongos(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x287) [0x5e9557]
/lib/libpthread.so.0(+0x69ca) [0x7fe99da279ca]
/lib/libc.so.6(clone+0x6d) [0x7fe99cfd670d]

Comment by Y. Wayne Huang [ 09/Nov/11 ]

not sure if it's another fix in head but the 2.0.1 nightly build for 11/9 caused our entire site to go down. all queries returned the following error:
could not initialize cursor across all shards because : could not initialize sharding on connection shard3/x:27017,y:27017 :: caused by :: need to specify namespace

we've rolled back to 2.0.1 stable.

Comment by Greg Studer [ 09/Nov/11 ]

Just mongos.

Comment by Y. Wayne Huang [ 08/Nov/11 ]

should we update just mongos or do we need to update mongod as well?

Comment by Greg Studer [ 08/Nov/11 ]

The writebacklistener fix has been backported, in the 2.0.1 nightly now (staged for 2.0.2), if you'd like to try it.

Comment by Y. Wayne Huang [ 08/Nov/11 ]

any updates?

Comment by Greg Studer [ 02/Nov/11 ]

will post once a build with this patch is available.

Comment by auto [ 01/Nov/11 ]

Author:

{u'login': u'gregstuder', u'name': u'gregs', u'email': u'greg@10gen.com'}

Message: force reload of config after two failed wbl checks SERVER-4118
Branch: v2.0
https://github.com/mongodb/mongo/commit/bcf4ca44072d45f4e8d87a3341350596ef54bff2

Comment by auto [ 01/Nov/11 ]

Author:

{u'login': u'gregstuder', u'name': u'gregs', u'email': u'greg@10gen.com'}

Message: force reload of config after two failed wbl checks SERVER-4118
Branch: v2.0
https://github.com/mongodb/mongo/commit/bcf4ca44072d45f4e8d87a3341350596ef54bff2

Comment by auto [ 01/Nov/11 ]

Author:

{u'login': u'gregstuder', u'name': u'gregs', u'email': u'greg@10gen.com'}

Message: force reload of config after two failed wbl checks SERVER-4118
Branch: master
https://github.com/mongodb/mongo/commit/c96569dc11ed389c7ea279d52b65753aa9946b18

Comment by auto [ 01/Nov/11 ]

Author:

{u'login': u'gregstuder', u'name': u'gregs', u'email': u'greg@10gen.com'}

Message: force reload of config after two failed wbl checks SERVER-4118
Branch: master
https://github.com/mongodb/mongo/commit/c96569dc11ed389c7ea279d52b65753aa9946b18

Comment by Greg Studer [ 01/Nov/11 ]

From the logs, this now seems like a related but different issue - shard version isn't getting force-reloaded after multiple retries.

Comment by Y. Wayne Huang [ 01/Nov/11 ]

any updates? we see this behavior affect our cluster about 2-3 times per week and a google search seems to reveal others are seeing it as well

Comment by Y. Wayne Huang [ 28/Oct/11 ]

log file for today attached

Comment by Y. Wayne Huang [ 28/Oct/11 ]

{
"host" : "mongos-1",
"version" : "2.0.1",
"process" : "mongos",
"uptime" : 19326,
"localTime" : ISODate("2011-10-28T06:11:30.833Z"),
"mem" :

{ "resident" : 14, "virtual" : 245, "supported" : true }

,
"connections" :

{ "current" : 9, "available" : 15991 }

,
"extra_info" :

{ "note" : "fields vary by platform", "heap_usage_bytes" : 952000, "page_faults" : 0 }

,
"opcounters" :

{ "insert" : 123760, "query" : 49708, "update" : 149348, "delete" : 4, "getmore" : 5575, "command" : 109302 }

,
"ops" : {
"sharded" :

{ "insert" : 52682, "query" : 9022, "update" : 86423, "delete" : 4, "getmore" : 5360, "command" : 0 }

,
"notSharded" :

{ "insert" : 71078, "query" : 40686, "update" : 62925, "delete" : 0, "getmore" : 215, "command" : 109302 }

},
"shardCursorType" :

{ "ParallelSort" : 9022 }

,
"asserts" :

{ "regular" : 0, "warning" : 0, "msg" : 0, "user" : 721, "rollovers" : 0 }

,
"network" :

{ "bytesIn" : 403446802, "bytesOut" : 3332333902, "numRequests" : 341639 }

,
"ok" : 1
}

Comment by Eliot Horowitz (Inactive) [ 28/Oct/11 ]

Can you send db.serverStatus() and the full mongos log?

Comment by Y. Wayne Huang [ 28/Oct/11 ]

we've since upgraded to 2.0.1 (from your repo) and still see the issue.

Comment by Eliot Horowitz (Inactive) [ 28/Oct/11 ]

Are you still on 2.0.0-rc2?
There was a change for 2.0.1 that might be an issue.

Comment by Y. Wayne Huang [ 28/Oct/11 ]

any updates?

Comment by Y. Wayne Huang [ 25/Oct/11 ]

that mongostat was hard to read.. this might be better:
https://skitch.com/wayne530/gdce6/mongostat

Comment by Y. Wayne Huang [ 25/Oct/11 ]

Greg, we are seeing the same issue right now. mongos appears to be retrying some type of query infinitely, resulting in a 'shard version not ok in Client::Context' each time:

Tue Oct 25 09:39:50 [conn1880] Assertion: 13388:[mydb.mycoll] shard version not ok in Client::Context: client in sharded mode, but doesn't have version set for this collection: mydb.mycoll myVersion: 155|3
Tue Oct 25 09:39:50 [conn1886] Assertion: 13388:[mydb.mycoll] shard version not ok in Client::Context: client in sharded mode, but doesn't have version set for this collection: mydb.mycoll myVersion: 155|3
Tue Oct 25 09:39:50 [conn1880] Assertion: 13388:[mydb.mycoll] shard version not ok in Client::Context: client in sharded mode, but doesn't have version set for this collection: mydb.mycoll myVersion: 155|3
Tue Oct 25 09:39:50 [conn1880] Assertion: 13388:[mydb.mycoll] shard version not ok in Client::Context: client in sharded mode, but doesn't have version set for this collection: mydb.mycoll myVersion: 155|3
Tue Oct 25 09:39:50 [conn1880] Assertion: 13388:[mydb.mycoll] shard version not ok in Client::Context: client in sharded mode, but doesn't have version set for this collection: mydb.mycoll myVersion: 155|3
Tue Oct 25 09:39:50 [conn1880] Assertion: 13388:[mydb.mycoll] shard version not ok in Client::Context: client in sharded mode, but doesn't have version set for this collection: mydb.mycoll myVersion: 155|3
Tue Oct 25 09:39:50 [conn1886] Assertion: 13388:[mydb.mycoll] shard version not ok in Client::Context: client in sharded mode, but doesn't have version set for this collection: mydb.mycoll myVersion: 155|3
Tue Oct 25 09:39:50 [conn1886] Assertion: 13388:[mydb.mycoll] shard version not ok in Client::Context: client in sharded mode, but doesn't have version set for this collection: mydb.mycoll myVersion: 155|3
Tue Oct 25 09:39:50 [conn1886] Assertion: 13388:[mydb.mycoll] shard version not ok in Client::Context: client in sharded mode, but doesn't have version set for this collection: mydb.mycoll myVersion: 155|3
Tue Oct 25 09:39:50 [conn1880] Assertion: 13388:[mydb.mycoll] shard version not ok in Client::Context: client in sharded mode, but doesn't have version set for this collection: mydb.mycoll myVersion: 155|3
Tue Oct 25 09:39:50 [conn1886] Assertion: 13388:[mydb.mycoll] shard version not ok in Client::Context: client in sharded mode, but doesn't have version set for this collection: mydb.mycoll myVersion: 155|3
Tue Oct 25 09:39:50 [conn1880] Assertion: 13388:[mydb.mycoll] shard version not ok in Client::Context: client in sharded mode, but doesn't have version set for this collection: mydb.mycoll myVersion: 155|3
Tue Oct 25 09:39:50 [conn1886] Assertion: 13388:[mydb.mycoll] shard version not ok in Client::Context: client in sharded mode, but doesn't have version set for this collection: mydb.mycoll myVersion: 155|3
Tue Oct 25 09:39:50 [conn1880] Assertion: 13388:[mydb.mycoll] shard version not ok in Client::Context: client in sharded mode, but doesn't have version set for this collection: mydb.mycoll myVersion: 155|3
Tue Oct 25 09:39:50 [conn1886] Assertion: 13388:[mydb.mycoll] shard version not ok in Client::Context: client in sharded mode, but doesn't have version set for this collection: mydb.mycoll myVersion: 155|3
Tue Oct 25 09:39:50 [conn1880] Assertion: 13388:[mydb.mycoll] shard version not ok in Client::Context: client in sharded mode, but doesn't have version set for this collection: mydb.mycoll myVersion: 155|3

here is mongostat from the affected shard:

insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn set repl time
4 0 0 0 4 2562 0 132g 265g 3.12g 1 15.9 0 0|0 1|1 431k 569k 87 shard2 M 09:44:26
0 0 0 0 0 2341 0 132g 265g 3.12g 0 12.5 0 0|0 1|1 393k 516k 87 shard2 M 09:44:27
1 0 0 0 1 2546 0 132g 265g 3.12g 4 18.3 0 0|0 1|1 429k 562k 87 shard2 M 09:44:28
1 0 0 0 1 2620 0 132g 265g 3.12g 0 14.7 0 0|0 1|1 442k 582k 87 shard2 M 09:44:29
1 0 0 0 1 2562 0 132g 265g 3.12g 0 14.4 0 0|0 1|1 433k 570k 87 shard2 M 09:44:30
7 0 0 0 7 2490 0 132g 265g 3.12g 1 17.1 0 0|0 1|1 421k 556k 87 shard2 M 09:44:31
0 0 0 0 0 2531 0 132g 265g 3.12g 0 14.1 0 0|0 2|1 425k 557k 87 shard2 M 09:44:32
8 0 0 0 4 2592 0 132g 265g 3.12g 0 15.6 0 0|0 1|1 436k 575k 87 shard2 M 09:44:33
0 0 0 0 0 2374 0 132g 265g 3.12g 0 14 0 0|0 1|1 397k 521k 87 shard2 M 09:44:34
1 0 0 0 1 2456 0 132g 265g 3.12g 0 14.4 0 0|0 2|1 414k 543k 87 shard2 M 09:44:35
insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn set repl time
0 0 0 0 0 2585 0 132g 265g 3.11g 0 14.9 0 0|0 2|1 434k 574k 87 shard2 M 09:44:36
1 0 0 0 1 2194 0 132g 265g 3.11g 0 12.3 0 0|0 1|1 368k 483k 87 shard2 M 09:44:37
2 0 0 0 2 2545 0 132g 265g 3.11g 0 14.4 0 0|0 1|1 436k 575k 87 shard2 M 09:44:38
0 0 0 0 0 2514 0 132g 265g 3.11g 0 13.4 0 3|1 3|1 419k 551k 87 shard2 M 09:44:39
1 0 0 0 1 2657 0 132g 265g 3.11g 0 14.2 0 0|0 1|1 446k 585k 87 shard2 M 09:44:40
3 0 0 0 3 2558 0 132g 265g 3.11g 0 16.6 0 0|0 1|1 432k 571k 87 shard2 M 09:44:41
0 0 0 0 0 2444 0 132g 265g 3.11g 0 14 0 0|0 1|1 410k 538k 87 shard2 M 09:44:42
10 0 0 0 8 2684 0 132g 265g 3.11g 0 13.8 0 0|0 1|1 454k 599k 87 shard2 M 09:44:43
4 0 0 0 4 2546 0 132g 265g 3.11g 0 14 0 0|0 1|1 429k 562k 87 shard2 M 09:44:44
1 0 0 0 1 2464 0 132g 265g 3.11g 0 13.4 0 0|0 1|1 415k 545k 87 shard2 M 09:44:45
insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn set repl time
1 0 0 0 1 2483 0 132g 265g 3.11g 1 13.9 0 0|0 1|1 420k 555k 87 shard2 M 09:44:46
1 0 0 0 1 2447 0 132g 265g 3.11g 0 12.8 0 0|0 1|1 411k 540k 87 shard2 M 09:44:47
0 0 0 0 0 2404 0 132g 265g 3.11g 0 12.2 0 0|0 2|1 402k 527k 87 shard2 M 09:44:48
0 0 0 0 0 115 0 132g 265g 3.11g 0 0.5 0 0|0 1|1 18k 28k 57 shard2 M 09:44:49
0 0 0 0 1 2 0 132g 265g 3.11g 0 0 0 0|0 1|1 241b 1k 57 shard2 M 09:44:50
0 0 0 0 0 8 0 132g 265g 3.11g 0 0 0 0|0 1|1 566b 4k 57 shard2 M 09:44:51
0 0 0 0 1 2 0 132g 265g 3.11g 0 0 0 0|0 1|1 241b 1k 57 shard2 M 09:44:52
0 0 0 0 0 14 0 132g 265g 3.11g 0 0 0 0|0 1|1 938b 7k 57 shard2 M 09:44:53
6 0 0 0 3 6 0 132g 265g 3.1g 0 0.2 0 0|0 1|1 2k 4k 45 shard2 M 09:44:54
0 0 0 0 0 8 0 132g 265g 3.1g 0 0 0 0|0 1|1 726b 2k 48 shard2 M 09:44:55
insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn set repl time
0 0 0 0 1 16 0 132g 265g 3.1g 0 0 0 0|0 1|1 1k 7k 49 shard2 M 09:44:56
1 0 0 0 1 5 0 132g 265g 3.1g 0 0 0 0|0 1|1 1k 2k 49 shard2 M 09:44:57
1 0 0 0 1 4 0 132g 265g 3.1g 32 0 0 0|0 2|1 894b 2k 49 shard2 M 09:44:58

i'm sure you can guess when i bounced mongos.

https://skitch.com/wayne530/gdc89/mongodb-ops

i don't think this is some kind of transient system issue as it's pretty clear that mongos is retrying some type of command query repeatedly.

Comment by Greg Studer [ 24/Oct/11 ]

As a workaround - you can set the maxConns parameter in mongos >= 1.8.3 to limit the maximum number of connections it will open per-host - the default behavior is to try to service requests no matter how fast they come in.

Also, if you increase the verbosity of mongos and mongod to logLevel : 2 - do you see the commands being (re)-sent?

This may be a systemic issue - where a small backup on the primary causes new connections and larger backups on the primary, and so on...

Comment by Y. Wayne Huang [ 21/Oct/11 ]

shard2 command ops graph:
https://skitch.com/wayne530/gnwc4/shard2

unfortunately, during this process, with mongos (v2.0.1-pre- git:a7a36ecbf060624a1bfcf36e0a00de02b907c019) there is no log output. the affected shard primaries also do not log the command queries being performed, however we definitely see them show up in the count using mongostat.

Comment by Y. Wayne Huang [ 21/Oct/11 ]

upon further investigation, the log messages above may be unrelated. here's what we see:

command ops/sec on some shard primaries will increase to 1500-2000 ops/sec, during which performance is degraded. bouncing mongos fixes the issue. it appears that asking an affected primary to stepdown stops the activity momentarily but eventually the problem will also affect the new primary. it seems bouncing mongos is the only sure way to stop the issue for an extended period of time.

the above issue we believe is different. there's an rsSync process running which has a global write lock. to the best of my knowledge, it's supposed to yield that lock as necessary. what we saw occur is that the process stopped yielding the lock, thus causing commands to pile up behind it. the connections from mongos, normally short-lived, now persist because the command it is routing is blocking due to the apparently non-yielding global write lock from the rsSync process. the above connections are within range for our typical workload (~10-20 ops/sec). when viewing db.currentOp() during the pile up, the only process with a write lock was this rsSync process:

{
"opid" : "shard1:23606014",
"active" : true,
"lockType" : "write",
"waitingForLock" : false,
"secs_running" : 5497,
"op" : "none",
"ns" : "mydb.mycoll",
"query" : {

},
"client_s" : "",
"desc" : "rsSync",
"threadId" : "0x7fec5b5f8700",
"numYields" : 1
},

Generated at Thu Feb 08 03:05:00 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.