[SERVER-11332] Authentication requests delayed if first config server is unresponsive Created: 23/Oct/13  Updated: 11/Jul/16  Resolved: 22/May/14

Status: Closed
Project: Core Server
Component/s: Performance, Sharding
Affects Version/s: 2.4.6
Fix Version/s: 2.6.2, 2.7.1

Type: Improvement Priority: Major - P3
Reporter: Alexander Komyagin Assignee: Greg Studer
Resolution: Done Votes: 6
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

sharded cluster, 3 config servers, auth


Attachments: Text File SERVER-11332 mongos verbose log.txt     Text File SERVER-11332 reproduce notes.txt     File sync_hung_cmd.js    
Issue Links:
Depends
Duplicate
is duplicated by SERVER-13323 listDBs block when first mongo config... Closed
is duplicated by SERVER-9916 be smarter about config server retrie... Closed
Related
Backport Completed:
Participants:

 Description   
Issue Status as of May 14, 2014

ISSUE SUMMARY
For MongoDB sharded clusters with authentication enabled, authentication requests on new connections can query the first config server if authentication data is not already cached. If this config server is unresponsive, there is a 30 second timeout after which the next config server is contacted. These long 30-second timeouts sometimes cause delays on new connections, manifesting as slow queries or other operations. An internal internalSCCAllowFastestAuthConfigReads mongos server parameter was added to enable reading authentication data from the first-to-respond config server.

USER IMPACT
In authenticated environments, when the first config server becomes unresponsive (note: this is different from the config server shutting down as connections would then fail immediately) and authentication data is not cached, queries and other operations can be delayed by up to 30 seconds.

WORKAROUNDS
The preferred workaround is to block the first config server using a firewall (e.g. with iptables) to make connections to it fail immediately. In this case, the second config server is contacted without the 30-second delay. If this is not possible, the internal mongos parameter internalSCCAllowFastestAuthConfigReads can be used to workaround the issue.

AFFECTED VERSIONS
All previous versions are affected by this issue.

FIX VERSION
The fix is included in the 2.6.2 production release.

RESOLUTION DETAILS
For authentication requests (and only for those), a parameter internalSCCAllowFastestAuthConfigReads was added to allow all three config servers to be queried concurrently. To ensure consistent reads of all other metadata, all other requests use the normal mechanism of contacting the first config server, with a 30-second timeout.

Original description

Normal collection operations, do not touch config server.
But other things do.
Some examples:

  • authentication
  • splits/balancer
  • listDatabases
  • creating database
  • creating collection

Possible Solutions:

  • send reads to all (maybe with a tiny backoff), respond from first response (maybe with threshold) (preferred)
  • blacklist (a bit ugly + racy)


 Comments   
Comment by Githook User [ 15/May/14 ]

Author:

{u'username': u'gregstuder', u'name': u'Greg Studer', u'email': u'greg@10gen.com'}

Message: SERVER-11332 hookup of fastest query to SyncClusterConnection
(cherry picked from commit d2e4b7d17a8b4a406f053e39f692f394d66e6b11)
Branch: v2.6
https://github.com/mongodb/mongo/commit/945ed48cc77ecbc97f7fdc6f7a06c8968a7a14c5

Comment by Githook User [ 15/May/14 ]

Author:

{u'username': u'gregstuder', u'name': u'Greg Studer', u'email': u'greg@10gen.com'}

Message: SERVER-11332 multi host query from fastest host using thread pools

(cherry picked from commit f8f57002f72e38d8595674937cd11df42b4ecba7)
(cherry picked from commit db7e5996c7da7d3383ae2c211171bb21ae2b7e00)
Branch: v2.6
https://github.com/mongodb/mongo/commit/fef6805061b31e1c6269a438c1922f17db72213b

Comment by Githook User [ 15/May/14 ]

Author:

{u'username': u'gregstuder', u'name': u'Greg Studer', u'email': u'greg@10gen.com'}

Message: SERVER-11332 minor cleanup of SCC and chunk diff timeout
(cherry picked from commit ac43ecd3c540ad5c191dec27d9fb5a7b0ac4e8f9)
Branch: v2.6
https://github.com/mongodb/mongo/commit/261a158b8729bb97c545edbe66c6a53a8aa8c7f4

Comment by Githook User [ 15/May/14 ]

Author:

{u'username': u'gregstuder', u'name': u'Greg Studer', u'email': u'greg@10gen.com'}

Message: SERVER-11332 ConnectionString less-than for simpler use in maps
(cherry picked from commit 0e3d4410933999e94a5937b08491824138c654d6)
Branch: v2.6
https://github.com/mongodb/mongo/commit/1e9944fbed01c900cb2d8c7e38b38a7acf9e657b

Comment by Githook User [ 14/May/14 ]

Author:

{u'username': u'gregstuder', u'name': u'Greg Studer', u'email': u'greg@10gen.com'}

Message: SERVER-11332 hookup of fastest query to SyncClusterConnection
Branch: master
https://github.com/mongodb/mongo/commit/d2e4b7d17a8b4a406f053e39f692f394d66e6b11

Comment by Githook User [ 14/May/14 ]

Author:

{u'username': u'gregstuder', u'name': u'Greg Studer', u'email': u'greg@10gen.com'}

Message: SERVER-11332 multi_host_query_test fix race between timeout and last result
Branch: master
https://github.com/mongodb/mongo/commit/db7e5996c7da7d3383ae2c211171bb21ae2b7e00

Comment by Githook User [ 13/May/14 ]

Author:

{u'username': u'gregstuder', u'name': u'Greg Studer', u'email': u'greg@10gen.com'}

Message: SERVER-11332 multi host query from fastest host using thread pools
Branch: master
https://github.com/mongodb/mongo/commit/f8f57002f72e38d8595674937cd11df42b4ecba7

Comment by Githook User [ 12/May/14 ]

Author:

{u'username': u'benety', u'name': u'Benety Goh', u'email': u'benety@mongodb.com'}

Message: Revert "SERVER-11332 multi host query from fastest host using thread pools"

This reverts commit 03f0d9c627136c6296de400467bbbbd73c9d7a72
Branch: master
https://github.com/mongodb/mongo/commit/ce04ab3728edeff71f0c32590558cb980a07fdb3

Comment by Githook User [ 12/May/14 ]

Author:

{u'username': u'gregstuder', u'name': u'Greg Studer', u'email': u'greg@10gen.com'}

Message: SERVER-11332 multi host query from fastest host using thread pools
Branch: master
https://github.com/mongodb/mongo/commit/03f0d9c627136c6296de400467bbbbd73c9d7a72

Comment by Githook User [ 18/Apr/14 ]

Author:

{u'username': u'gregstuder', u'name': u'Greg Studer', u'email': u'greg@10gen.com'}

Message: SERVER-11332 minor cleanup of SCC and chunk diff timeout
Branch: master
https://github.com/mongodb/mongo/commit/ac43ecd3c540ad5c191dec27d9fb5a7b0ac4e8f9

Comment by Githook User [ 18/Apr/14 ]

Author:

{u'username': u'gregstuder', u'name': u'Greg Studer', u'email': u'greg@10gen.com'}

Message: SERVER-11332 ConnectionString less-than for simpler use in maps
Branch: master
https://github.com/mongodb/mongo/commit/0e3d4410933999e94a5937b08491824138c654d6

Comment by Asya Kamsky [ 03/Jan/14 ]

jstest attached. works on Mac and Linux.

Sample output with mongoX process logging suppressed:

{ "shardAdded" : "shard0000", "ok" : 1 }
adding admin user
{
	"user" : "admin",
	"pwd" : "1749c2646695f9a77e2c4fdda2e7f585",
	"roles" : [
		"userAdminAnyDatabase",
		"readWriteAnyDatabase",
		"clusterAdmin"
	],
	"_id" : ObjectId("52c5ff8a1b6a0dd66c9c9da9")
}
logging in as admin user
1
adding regular user
{
	"user" : "foo",
	"readOnly" : false,
	"pwd" : "3563025c1e89c7ad43fb63fcbcf1c3c6",
	"_id" : ObjectId("52c5ff8a1b6a0dd66c9c9daa")
}
1
 
 
----
All three configs are up!  Took 0.002 sec to log in
----
 
 
Thu Jan  2 19:08:44.143 shell: stopped mongo program on port 29000
First server is down
1
 
 
----
First config is completely down!  Took 0.002 sec to log in
----
 
 
Thu Jan  2 19:08:44.146 shell: started program mongod --port 29000 --dbpath /data/db/hungConfigServer-config0 --keyFile /Users/asya13/keyFile --configsvr --setParameter enableTestCommands=1 --setParameter enableTestCommands=1
First config is back
Running kill -TSTP `ps auxww | grep mongod | grep -v kill | grep -v grep | grep 29000 | awk '{print $2}'`
Thu Jan  2 19:08:44.349 shell: started program bash -c kill -TSTP `ps auxww | grep mongod | grep -v kill | grep -v grep | grep 29000 | awk '{print $2}'`
1
 
 
----
First config is hung/not answering!  Took 30.002 sec to log in
----
 
 
Thu Jan  2 19:09:14.432 shell: started program bash -c kill -CONT `ps auxww | grep mongod | grep -v kill | grep -v grep | grep 29000 | awk '{print $2}'`
Thu Jan  2 19:09:15.519 shell: stopped mongo program on port 30999
Thu Jan  2 19:09:16.520 shell: stopped mongo program on port 30000
Thu Jan  2 19:09:17.521 shell: stopped mongo program on port 29000
Thu Jan  2 19:09:18.522 shell: stopped mongo program on port 29001
Thu Jan  2 19:09:19.523 shell: stopped mongo program on port 29002
*** ShardingTest hungConfigServer completed successfully in 40.898 seconds ***

I didn't add any asserts but a trivial fix would be to assert if any of the logins/auths take more than 1 second (30 seconds is what it was taking on my mac - it seems to be OS dependent).

Comment by Asya Kamsky [ 01/Jan/14 ]

To clarify, this issue only happens when opening new connections with --auth on. Using connection pooling would minimize impact of this to operations that need to write to config (new dB's, etc).

Comment by Henrik Ingo (Inactive) [ 26/Dec/13 ]

Adding as attachment steps to reproduce and also output from running the test plus snippet of mongos verbose log.

Summary:

  • issue happens on a sharded cluster when using authentication
  • happens both for a user specificied in admin db and in the application db (test)
  • when not running with --auth/--keyFile, issue does not happen
Generated at Thu Feb 08 03:25:30 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.