[SERVER-40762] Mongos does not accept new connections after mongos restart Created: 22/Apr/19  Updated: 27/Oct/23  Resolved: 25/Jun/19

Status: Closed
Project: Core Server
Component/s: Networking
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: razvan velcea Assignee: Danny Hatcher (Inactive)
Resolution: Gone away Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File mgs13.log.gz     File test_mongo.rb     File test_mongo_connections.rb    
Operating System: ALL
Steps To Reproduce:
  • upgrade cluster from 3.4 to 3.6 with enable FCV 3.6 followed by 2nd mongoSes roll-restart
  • set FCV to 3.4
  • restart one mongos
Participants:

 Description   

After we did a MongoDB cluster upgrade from 3.4 to 3.6 we've noticed a high load on 1st shard. The load was related with high amount of updates made agains config.systems.sessions collection. To mitigate the high load on shard1 we've change FCV from 3.6 back to 3.4.

Few days after we had an mongoS instance restarted (underlying hardware failure). We've noticed that after the reboot, although the mongos service was up and running, it wasn't handling any traffic. 

The application is generating these log lines only if is set with debug mode on:

Command failed with error 211 (KeyNotFound): 'Cache Reader No keys found for HMAC that is valid for time: { ts: Timestamp(1553758276, 1424) } with id: 6672952338308661763' on server mongos13:27017. The full response is { "ok" : 0.0, "errmsg" : "Cache Reader No keys found for HMAC that is valid for time: { ts: Timestamp(1553758276, 1424) } with id: 6672952338308661763", "code" : 211, "codeName" : "KeyNotFound" }

These lines appear in mongoS log if we increase verbosity to 2 for network engine:

2019-04-22T07:18:10.769+0000 D ASIO [monitoring keys for HMAC] startCommand: RemoteCommand 30519803 -- target:cfg02:27019 db:admin expDate:2019-04-22T07:18:40.769+0000 cmd:{ find: "system.keys", filter: { purpose: "HMAC", expiresAt: { $gt: Timestamp(0, 0) } }, sort: { expiresAt: 1 }, readConcern: { level: "majority", afterOpTime: { ts: Timestamp(1555917490, 9), t: 13 } }, maxTimeMS: 30000
2019-04-22T07:18:10.769+0000 D ASIO [NetworkInterfaceASIO-ShardRegistry-0] Starting asynchronous command 30519803 on host cfg02:27019 
2019-04-22T07:18:10.770+0000 D ASIO [NetworkInterfaceASIO-ShardRegistry-0] Request 30519803 finished with response: { cursor: { id: 0, ns: "admin.system.keys", firstBatch: [] }, ok: 1.0, $replData: { term: 13, lastOpCommitted: { ts: Timestamp(1555917490, 9), t: 13 }, lastOpVisible: { ts: Timestamp(1555917490, 9), t: 13 }, configVersion: 9, replicaSetId: ObjectId('59b0fb4ec5311a5aa291a03e'), primaryIndex: 0, syncSourceIndex: 0 }, $gleStats: { lastOpTime: Timestamp(0, 0), electionId: ObjectId('000000000000000000000000') } } 

 



 Comments   
Comment by Danny Hatcher (Inactive) [ 25/Jun/19 ]

I'm glad to hear that you resolved your issue. However, I'm not sure how "safe" it is to manually split the config.system.sessions. It shouldn't cause any issues but I do not believe it is something that we test for. If you experience issues later on, please try dropping that collection and letting the sharding system automatically create it and shard it.

Comment by razvan velcea [ 24/Jun/19 ]

Hi Daniel,

Regarding your last question: I've removed SSL/x509 auth and i didnt encounter the mongos restart issue anymore on test cluster. 

As i said in description this issue manifests only after i've downgraded from FCV 3.6 to FCV 3.4, FCV downgrade was necessary cos of shard 1 high load induced by updates on config.systems.sessions collection.

Now we are back to FCV 3.6 . I've manage to mitigate the shard 1 high load induced by updates: i've manually split config.systems.sessions collection in several chunks. Now, after the balancer distributed the chunks across all shards, the updates agains config.systems.sessions collection collection ere evenly distributed across all shards.

 

This is no longer an issue for us.

Thank you,

Razvan

 

 

Comment by Danny Hatcher (Inactive) [ 15/May/19 ]

Hello Razvan,

Unfortunately, as that version of the Ruby driver does not support MongoDB 3.6, I would not recommend you using that on a permanent basis. However, if you are using it temporarily while you are on FCV 3.4 and your application doesn't experience any issues. It should be fine on a limited basis.

I notice in your original test and in the newest one you referenced you are utilizing x509 authentication. If you set up your test cluster to use SCRAM authentication instead do you still experience the problem? If you disable SSL entirely on the test cluster do you still experience it?

Thanks,

Danny

Comment by razvan velcea [ 15/May/19 ]

Hi Daniel,

 

I've searched to see which ruby drivers are compatible with 3.6 and which are compatible with 3.4 only: https://docs.mongodb.com/ecosystem/drivers/driver-compatibility-reference/#ruby-driver-compatibility . 
Also i've looked here to see all available gems versions : https://rubygems.org/gems/mongo/versions .

I've remade tests using same ruby script and mongo gem 2.4.3 which supports only MongoDB 3.4.
The "No keys found for HMAC" issue has not appeared anymore.

 

Thank you,

Razvan

Comment by razvan velcea [ 10/May/19 ]

Hi Daniel,

 

I done the steps with a completely new environment and got to same results . 

  • I've deployed 3.4.10 cluster (3 x mongos, 3 x config, 2 x shards, 3 x nodes per shard) 
  • i've upgraded cluster to 3.6.10
  • i've set fcv to 3.6 then a roll mongos restart
  • i've set a test ruby script to connect to two of the mongos-es then launch it
  • i've set fcv to 3.4 
  • i've restarted one mongos, ruby test app started to show these lines from time to time :

Cannot connect to the server
User  (mechanism: mongodb_x509) is not authorized to access $external (used mechanism: MONGODB-X509)
D, [2019-05-10T12:42:25.461436 #31506] DEBUG -- : MONGODB | mgs1:27017 | $external.authenticate | STARTED | {}
D, [2019-05-10T12:42:25.463183 #31506] DEBUG -- : MONGODB | mgs1:27017 | $external.authenticate | FAILED | Cache Reader No keys found for HMAC that is valid for time: { ts: Timestamp(1557491955, 2) } with id: 6689372749900546049 (211) | 0.001649321s
W, [2019-05-10T12:42:25.463266 #31506]  WARN -- : MONGODB | Failed to handshake with mgs1:27017: Mongo::Auth::Unauthorized: User  (mechanism: mongodb_x509) is not authorized to access $external (used mechanism: MONGODB-X509)

* i've restarted ruby test app , now it shows these lines from time to time : 

Operation Failure
Cache Reader No keys found for HMAC that is valid for time: { ts: Timestamp(1557491955, 2) } with id: 6689372749900546049 (211)
D, [2019-05-10T12:50:57.597899 #33538] DEBUG -- : MONGODB | mgs1:27017 | test.find | STARTED | {"find"=>"coll", "filter"=>{"name"=>"qwerty"}, "projection"=>{"_id"=>0}, "$clusterTime"=>{"clusterTime"=>#<BSON::Timestamp:0x00000000011315e0 @seconds=1557491955, @increment=2>, "signature"=>{"hash"=><BSON::Binary:0x9013820 type=generic data=0x326e7da...
D, [2019-05-10T12:50:57.600023 #33538] DEBUG -- : MONGODB | mgs1:27017 | test.find | FAILED | Cache Reader No keys found for HMAC that is valid for time: { ts: Timestamp(1557491955, 2) } with id: 6689372749900546049 (211) | 0.001970795s

 

Thank you,

Razvan

 

Comment by razvan velcea [ 09/May/19 ]

Hi Daniel,

 

I will test using a completely empty environment then i will come back to you.

 

Thank you,

Razvan

Comment by Danny Hatcher (Inactive) [ 08/May/19 ]

Hello Razvan,

I apologize, you are correct that I missed that final step. However, I have retested following the steps you mentioned and I am still unable to reproduce using the ruby script I attached. When you attempted the reproduction in your test environment, was it against a copy of your real data or was it against a brand new sharded cluster with nothing in it? If you haven't tested using a completely empty environment, could you try to do so?

Thanks,

Danny

Comment by razvan velcea [ 08/May/19 ]

Hi Daniel,

 

Thank you for looking into it and for the tests you've made.

I think i mislead you by insisting too much with the 2nd MongoSes restart required by MongoDB 3.6 upgrade after you set FCV to 3.6 https://docs.mongodb.com/manual/release-notes/3.6-upgrade-sharded-cluster/index.html#restart-mongos-instances . I wanted to make sure that this step is not overlooked. 

Our issue is not surface in "Upgrade MongoDB Cluster to 3.6" which include the "Set the FCV to 3.6" and "2nd mongos-es restart" .

Our issue surface after we've lower FCV to 3.4. 

Steps should be :

  1. Start up a 3.4.19 sharded cluster with two mongos nodes.
  2. Start the test script - set it to connect to both mongos nodes.
  3. Upgrade the nodes to 3.6.12 following documented upgraded procedure.
  4. Set the FCV of the cluster to 3.6.
  5. Restart the two mongos nodes.
  6. Set the FCV of the cluster to 3.4.
  7. Restart one of the mongos node again.

Like i said in description we've upgraded to 3.6 (including fcv 3.6) then we've noticed a high load on 1st shard . The high load was induced by high amount of updates made agains config.systems.sessions collection. To mitigate the high load on 1st shard we were forced to set FCV from 3.6 back to 3.4. This step stopped the high amount of updates and lowered the load on 1st shard. Then we had an unscheduled mongos restart and discover this issue : a restarted mongos no longer accept app connections (something related with admin.system.keys collections which is no longer present on config ).

 

Thank you,

Razvan

 

Comment by Danny Hatcher (Inactive) [ 07/May/19 ]

Hello Razvan,

Thank you for providing an example script; I was able to use that in my testing. I modified it slightly to catch MongoDB socket errors and reduced the number of the connection options. test_mongo.rb Then I followed the process you mentioned in your comment.

1. Start up a 3.4.19 sharded cluster with two mongos nodes.
2. Start the test script.
3. Upgrade the nodes to 3.6.12 following our documented upgraded procedure.
4. Set the FCV of the cluster to 3.6.
5. Restart the two mongos nodes.
6. Restart one of the mongos nodes again.

However, I was unable to reach a state where one of the mongos nodes received no traffic. It did take some time after an occasional restart for the node to be included in requests but it eventually was included again. Following the exact steps I listed above and the ruby file attached, do you still run into the issue?

Comment by razvan velcea [ 30/Apr/19 ]

Hi Daniel,

 

Additional info : 

  • Application servers are using mongodb java driver 3.7.1  and 3.8.2 .
  • I noticed that admin.system.keys collection is no longer present on configs after FCV was set back to 3.4.

 

I've reproduced issue on a test environment using a ruby script (not java) : 

  • i've upgrade a mongodb cluster from 3.4 to 3.6 (with enable FCV 3.6 followed by 2nd mongoSes roll-restart)
  • i've set FCV to 3.4 
  • i've created a ruby script and i set it to connect it to two mongoSes (using mongodb 2.8.0 gem) : test_mongo_connections.rb
  • i've restared one mongos

 

On same test environment i've manage to make the restarted mongoS to accept connection from test ruby script only after i did this action : 

  • i've removed 3.6 binaries and installed 3.4 binaries .

 

Thank you,

Razvan

 

Comment by razvan velcea [ 30/Apr/19 ]

Hi Daniel,

 

This is not referring to a project that has purchased support.

I've attached mongos log for the day when was restarted : mgs13.log.gz .

As mentioned in ticket description the issue surface after the restart which happen around 2019-04-17 9:57.

Before 2019-04-17 9:57 there you will find several clients connect/authentications initiated to this mongoS and that right after the restart you will find few of these lines and then no more new connections/authentications until the end of day. These lines : 

 

authenticate db: $external { authenticate: 1, user: "CN=Client,OU=Client,O=org,L=San Jose,ST=California,C=US", mechanism: "MONGODB-X509", $db: "$external" }

We are able to connect to this mongoS shell and exec queries against the cluster, apparently only the app servers are having a problem with this mongoS.

We did restart the mongoS later around 2019-04-17T12:26, the issue is still there.

 

Thank you,

Razvan

 

 

Comment by Danny Hatcher (Inactive) [ 24/Apr/19 ]

Hello,

Is this referring to a project that has purchased support with us? If so please let me know but I will proceed assuming that's not the case.

Can you please attach the full mongos logs to this ticket? Can you connect to the mongos and manually run queries? If you restart the mongos does it function correctly?

Thanks,

Danny

Generated at Thu Feb 08 04:55:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.