[SERVER-40762] Mongos does not accept new connections after mongos restart Created: 22/Apr/19 Updated: 27/Oct/23 Resolved: 25/Jun/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Networking |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | razvan velcea | Assignee: | Danny Hatcher (Inactive) |
| Resolution: | Gone away | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Operating System: | ALL |
| Steps To Reproduce: |
|
| Participants: |
| Description |
|
After we did a MongoDB cluster upgrade from 3.4 to 3.6 we've noticed a high load on 1st shard. The load was related with high amount of updates made agains config.systems.sessions collection. To mitigate the high load on shard1 we've change FCV from 3.6 back to 3.4. Few days after we had an mongoS instance restarted (underlying hardware failure). We've noticed that after the reboot, although the mongos service was up and running, it wasn't handling any traffic. The application is generating these log lines only if is set with debug mode on:
These lines appear in mongoS log if we increase verbosity to 2 for network engine:
|
| Comments |
| Comment by Danny Hatcher (Inactive) [ 25/Jun/19 ] | |||||||||
|
I'm glad to hear that you resolved your issue. However, I'm not sure how "safe" it is to manually split the config.system.sessions. It shouldn't cause any issues but I do not believe it is something that we test for. If you experience issues later on, please try dropping that collection and letting the sharding system automatically create it and shard it. | |||||||||
| Comment by razvan velcea [ 24/Jun/19 ] | |||||||||
|
Hi Daniel, Regarding your last question: I've removed SSL/x509 auth and i didnt encounter the mongos restart issue anymore on test cluster. As i said in description this issue manifests only after i've downgraded from FCV 3.6 to FCV 3.4, FCV downgrade was necessary cos of shard 1 high load induced by updates on config.systems.sessions collection. Now we are back to FCV 3.6 . I've manage to mitigate the shard 1 high load induced by updates: i've manually split config.systems.sessions collection in several chunks. Now, after the balancer distributed the chunks across all shards, the updates agains config.systems.sessions collection collection ere evenly distributed across all shards.
This is no longer an issue for us. Thank you, Razvan
| |||||||||
| Comment by Danny Hatcher (Inactive) [ 15/May/19 ] | |||||||||
|
Hello Razvan, Unfortunately, as that version of the Ruby driver does not support MongoDB 3.6, I would not recommend you using that on a permanent basis. However, if you are using it temporarily while you are on FCV 3.4 and your application doesn't experience any issues. It should be fine on a limited basis. I notice in your original test and in the newest one you referenced you are utilizing x509 authentication. If you set up your test cluster to use SCRAM authentication instead do you still experience the problem? If you disable SSL entirely on the test cluster do you still experience it? Thanks, Danny | |||||||||
| Comment by razvan velcea [ 15/May/19 ] | |||||||||
|
Hi Daniel,
I've searched to see which ruby drivers are compatible with 3.6 and which are compatible with 3.4 only: https://docs.mongodb.com/ecosystem/drivers/driver-compatibility-reference/#ruby-driver-compatibility . I've remade tests using same ruby script and mongo gem 2.4.3 which supports only MongoDB 3.4.
Thank you, Razvan | |||||||||
| Comment by razvan velcea [ 10/May/19 ] | |||||||||
|
Hi Daniel,
I done the steps with a completely new environment and got to same results .
Thank you, Razvan
| |||||||||
| Comment by razvan velcea [ 09/May/19 ] | |||||||||
|
Hi Daniel,
I will test using a completely empty environment then i will come back to you.
Thank you, Razvan | |||||||||
| Comment by Danny Hatcher (Inactive) [ 08/May/19 ] | |||||||||
|
Hello Razvan, I apologize, you are correct that I missed that final step. However, I have retested following the steps you mentioned and I am still unable to reproduce using the ruby script I attached. When you attempted the reproduction in your test environment, was it against a copy of your real data or was it against a brand new sharded cluster with nothing in it? If you haven't tested using a completely empty environment, could you try to do so? Thanks, Danny | |||||||||
| Comment by razvan velcea [ 08/May/19 ] | |||||||||
|
Hi Daniel,
Thank you for looking into it and for the tests you've made. I think i mislead you by insisting too much with the 2nd MongoSes restart required by MongoDB 3.6 upgrade after you set FCV to 3.6 https://docs.mongodb.com/manual/release-notes/3.6-upgrade-sharded-cluster/index.html#restart-mongos-instances . I wanted to make sure that this step is not overlooked. Our issue is not surface in "Upgrade MongoDB Cluster to 3.6" which include the "Set the FCV to 3.6" and "2nd mongos-es restart" . Our issue surface after we've lower FCV to 3.4. Steps should be :
Like i said in description we've upgraded to 3.6 (including fcv 3.6) then we've noticed a high load on 1st shard . The high load was induced by high amount of updates made agains config.systems.sessions collection. To mitigate the high load on 1st shard we were forced to set FCV from 3.6 back to 3.4. This step stopped the high amount of updates and lowered the load on 1st shard. Then we had an unscheduled mongos restart and discover this issue : a restarted mongos no longer accept app connections (something related with admin.system.keys collections which is no longer present on config ).
Thank you, Razvan
| |||||||||
| Comment by Danny Hatcher (Inactive) [ 07/May/19 ] | |||||||||
|
Hello Razvan, Thank you for providing an example script; I was able to use that in my testing. I modified it slightly to catch MongoDB socket errors and reduced the number of the connection options. test_mongo.rb 1. Start up a 3.4.19 sharded cluster with two mongos nodes. However, I was unable to reach a state where one of the mongos nodes received no traffic. It did take some time after an occasional restart for the node to be included in requests but it eventually was included again. Following the exact steps I listed above and the ruby file attached, do you still run into the issue? | |||||||||
| Comment by razvan velcea [ 30/Apr/19 ] | |||||||||
|
Hi Daniel,
Additional info :
I've reproduced issue on a test environment using a ruby script (not java) :
On same test environment i've manage to make the restarted mongoS to accept connection from test ruby script only after i did this action :
Thank you, Razvan
| |||||||||
| Comment by razvan velcea [ 30/Apr/19 ] | |||||||||
|
Hi Daniel,
This is not referring to a project that has purchased support. I've attached mongos log for the day when was restarted : mgs13.log.gz As mentioned in ticket description the issue surface after the restart which happen around 2019-04-17 9:57. Before 2019-04-17 9:57 there you will find several clients connect/authentications initiated to this mongoS and that right after the restart you will find few of these lines and then no more new connections/authentications until the end of day. These lines :
We are able to connect to this mongoS shell and exec queries against the cluster, apparently only the app servers are having a problem with this mongoS. We did restart the mongoS later around 2019-04-17T12:26, the issue is still there.
Thank you, Razvan
| |||||||||
| Comment by Danny Hatcher (Inactive) [ 24/Apr/19 ] | |||||||||
|
Hello, Is this referring to a project that has purchased support with us? If so please let me know but I will proceed assuming that's not the case. Can you please attach the full mongos logs to this ticket? Can you connect to the mongos and manually run queries? If you restart the mongos does it function correctly? Thanks, Danny |