[SERVER-4118] mongos causes dos by opening a ton of connections Created: 21/Oct/11 Updated: 11/Jul/16 Resolved: 08/Dec/11 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Performance, Sharding, Stability |
| Affects Version/s: | 2.0.0-rc2 |
| Fix Version/s: | 2.0.2 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Y. Wayne Huang | Assignee: | Greg Studer |
| Resolution: | Done | Votes: | 2 |
| Labels: | mongos | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
ubuntu 10.04 x86_64; 4 shards, each shard a RS of two members + arbiter |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Operating System: | Linux | ||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
Comment for future reference - this ticket was used primarily to track a writeback listener change in 2.0.1 - the wbl was not forcing a version reload which would reset the connection version on the mongod side, even after multiple failed retries. The changes in this ticket fix that. this issue may be similar to or related to on the shards, we can see that mongos effectively has caused a dos by issuing many quick successive connections: Fri Oct 21 11:50:49 [initandlisten] connection accepted from mongos:35539 #9663 the load on this shard RS member increased to 15 and became unresponsive. bouncing mongos stopped the connections/queries and service returned to normal. during the initial stages of the problem, performance degraded severely eventually leading to all queries timing out. |
| Comments |
| Comment by Greg Studer [ 08/Dec/11 ] |
|
np - definitely let us know if you see this again or any other issue. Closing for now - feel free to reopen if you see this issue again. |
| Comment by Y. Wayne Huang [ 08/Dec/11 ] |
|
we're running 2.0.2 rc1 for mongos and 2.0.1 for mongod and so far so good. thanks for the fix |
| Comment by Greg Studer [ 10/Nov/11 ] |
|
Should be within a week for rc0. Just pushed some backwards-compatibility changes as well. |
| Comment by Greg Studer [ 09/Nov/11 ] |
|
Not exactly sure, having a scheduling meeting tomorrow, so will update. We need to do more testing of the final branch once we freeze it too. |
| Comment by Y. Wayne Huang [ 09/Nov/11 ] |
|
when can we expect 2.0.2? |
| Comment by Greg Studer [ 09/Nov/11 ] |
|
> Wed Nov 9 12:24:42 [conn7] warning: adding shard sub-connection shard1/a:27011,b:27012 (parent shard1/a:27011,b:27012) as sharded, this is safe but unexpected > could not initialize cursor across all shards because : could not initialize sharding on connection shard3/x:27017,y:27017 :: caused by :: need to specify namespace |
| Comment by Y. Wayne Huang [ 09/Nov/11 ] |
|
also see this in the mongos log: Wed Nov 9 12:24:42 [conn7] warning: adding shard sub-connection shard1/a:27011,b:27012 (parent shard1/a:27011,b:27012) as sharded, this is safe but unexpected |
| Comment by Y. Wayne Huang [ 09/Nov/11 ] |
|
not sure if it's another fix in head but the 2.0.1 nightly build for 11/9 caused our entire site to go down. all queries returned the following error: we've rolled back to 2.0.1 stable. |
| Comment by Greg Studer [ 09/Nov/11 ] |
|
Just mongos. |
| Comment by Y. Wayne Huang [ 08/Nov/11 ] |
|
should we update just mongos or do we need to update mongod as well? |
| Comment by Greg Studer [ 08/Nov/11 ] |
|
The writebacklistener fix has been backported, in the 2.0.1 nightly now (staged for 2.0.2), if you'd like to try it. |
| Comment by Y. Wayne Huang [ 08/Nov/11 ] |
|
any updates? |
| Comment by Greg Studer [ 02/Nov/11 ] |
|
will post once a build with this patch is available. |
| Comment by auto [ 01/Nov/11 ] |
|
Author: {u'login': u'gregstuder', u'name': u'gregs', u'email': u'greg@10gen.com'}Message: force reload of config after two failed wbl checks |
| Comment by auto [ 01/Nov/11 ] |
|
Author: {u'login': u'gregstuder', u'name': u'gregs', u'email': u'greg@10gen.com'}Message: force reload of config after two failed wbl checks |
| Comment by auto [ 01/Nov/11 ] |
|
Author: {u'login': u'gregstuder', u'name': u'gregs', u'email': u'greg@10gen.com'}Message: force reload of config after two failed wbl checks |
| Comment by auto [ 01/Nov/11 ] |
|
Author: {u'login': u'gregstuder', u'name': u'gregs', u'email': u'greg@10gen.com'}Message: force reload of config after two failed wbl checks |
| Comment by Greg Studer [ 01/Nov/11 ] |
|
From the logs, this now seems like a related but different issue - shard version isn't getting force-reloaded after multiple retries. |
| Comment by Y. Wayne Huang [ 01/Nov/11 ] |
|
any updates? we see this behavior affect our cluster about 2-3 times per week and a google search seems to reveal others are seeing it as well |
| Comment by Y. Wayne Huang [ 28/Oct/11 ] |
|
log file for today attached |
| Comment by Y. Wayne Huang [ 28/Oct/11 ] |
|
{ , , , , , }, , , , |
| Comment by Eliot Horowitz (Inactive) [ 28/Oct/11 ] |
|
Can you send db.serverStatus() and the full mongos log? |
| Comment by Y. Wayne Huang [ 28/Oct/11 ] |
|
we've since upgraded to 2.0.1 (from your repo) and still see the issue. |
| Comment by Eliot Horowitz (Inactive) [ 28/Oct/11 ] |
|
Are you still on 2.0.0-rc2? |
| Comment by Y. Wayne Huang [ 28/Oct/11 ] |
|
any updates? |
| Comment by Y. Wayne Huang [ 25/Oct/11 ] |
|
that mongostat was hard to read.. this might be better: |
| Comment by Y. Wayne Huang [ 25/Oct/11 ] |
|
Greg, we are seeing the same issue right now. mongos appears to be retrying some type of query infinitely, resulting in a 'shard version not ok in Client::Context' each time: Tue Oct 25 09:39:50 [conn1880] Assertion: 13388:[mydb.mycoll] shard version not ok in Client::Context: client in sharded mode, but doesn't have version set for this collection: mydb.mycoll myVersion: 155|3 here is mongostat from the affected shard: insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn set repl time i'm sure you can guess when i bounced mongos. https://skitch.com/wayne530/gdc89/mongodb-ops i don't think this is some kind of transient system issue as it's pretty clear that mongos is retrying some type of command query repeatedly. |
| Comment by Greg Studer [ 24/Oct/11 ] |
|
As a workaround - you can set the maxConns parameter in mongos >= 1.8.3 to limit the maximum number of connections it will open per-host - the default behavior is to try to service requests no matter how fast they come in. Also, if you increase the verbosity of mongos and mongod to logLevel : 2 - do you see the commands being (re)-sent? This may be a systemic issue - where a small backup on the primary causes new connections and larger backups on the primary, and so on... |
| Comment by Y. Wayne Huang [ 21/Oct/11 ] |
|
shard2 command ops graph: unfortunately, during this process, with mongos (v2.0.1-pre- git:a7a36ecbf060624a1bfcf36e0a00de02b907c019) there is no log output. the affected shard primaries also do not log the command queries being performed, however we definitely see them show up in the count using mongostat. |
| Comment by Y. Wayne Huang [ 21/Oct/11 ] |
|
upon further investigation, the log messages above may be unrelated. here's what we see: command ops/sec on some shard primaries will increase to 1500-2000 ops/sec, during which performance is degraded. bouncing mongos fixes the issue. it appears that asking an affected primary to stepdown stops the activity momentarily but eventually the problem will also affect the new primary. it seems bouncing mongos is the only sure way to stop the issue for an extended period of time. the above issue we believe is different. there's an rsSync process running which has a global write lock. to the best of my knowledge, it's supposed to yield that lock as necessary. what we saw occur is that the process stopped yielding the lock, thus causing commands to pile up behind it. the connections from mongos, normally short-lived, now persist because the command it is routing is blocking due to the apparently non-yielding global write lock from the rsSync process. the above connections are within range for our typical workload (~10-20 ops/sec). when viewing db.currentOp() during the pile up, the only process with a write lock was this rsSync process: { }, |