[SERVER-11980] Improve user cache invalidation enforcement on mongos Created: 05/Dec/13 Updated: 19/May/15 Resolved: 27/Apr/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Security, Sharding |
| Affects Version/s: | 2.5.4 |
| Fix Version/s: | 2.6.10, 2.7.1 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Andreas Nilsson | Assignee: | Spencer Brody (Inactive) |
| Resolution: | Done | Votes: | 1 |
| Labels: | 26qa | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||
| Backport Completed: | |||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||
| Description |
|
When updating user/roles info on a mongod/mongos the in-memory role graph and user cache is updated instantly in the standard case. In the case of multiple mongos's there is a 10 min interval in between the mongos pings to the config servers for new user and role data. This means that there is a potential 10 min delay in communicating information about for instance a revoked user across the cluster. This interval can be configured to be lower with the risk of introducing network noise and repeated cache invalidation. This can be resolved by implementing a piggyback of the ordinary ping done by mongos to the config servers every 30 seconds. An additional improvement would be to not invalidate the cache in its entirety but only update the parts that has been changed. |
| Comments |
| Comment by Githook User [ 01/May/15 ] | ||
|
Author: {u'username': u'gianpaj', u'name': u'Gianfranco Palumbo', u'email': u'gianpa@gmail.com'}Message: (cherry picked from commit a51c7bcac0b93ea0a0da73974bac6c469075864d) Signed-off-by: Spencer T Brody <spencer@mongodb.com> | ||
| Comment by Spencer Brody (Inactive) [ 16/Mar/15 ] | ||
|
eric.sommer, I feel like that is more of a problem with mongos connection behavior than the user cache invalidation. One config server being down shouldn't cause read-only operations on config data to take 10 seconds to complete. I suspect the upcoming work to change config servers to be a replica set and the accompanying rewrite of the code used for config server communication will alleviate this, but we should reconfirm after the changes have been implemented. | ||
| Comment by Eric Sommer [ 16/Mar/15 ] | ||
|
Request for backport to v2.6: A customer is running a DR test on a sharded cluster in which they take down one of the config servers. They are seeing timeouts in their application every 10 minutes. We've traced this back to the UserCacheInvalidatorThread. In the mongoS log (at logLevel 2), we see:
Then a 10s gap, during which the mongoS is trying to contact the downed config server and ending only when the socket times out, where no queries are being handled. Only after we see this line:
does mongoS begin routing queries again. The application query (which executes on the mongoD in a few ms once it gets there) has a >10s response time due to the delay on the mongoS. The delay appears to be caused by the inability to authenticate new connections from the time when the user cache is invalidated until it is updated. This is normally a short time, but takes 10s when a config server is down. Updating only the parts of the user cache that have changed should prevent this delay is handling the queries. Customer env: mongod version 2.6.4, connection using java driver | ||
| Comment by Githook User [ 14/May/14 ] | ||
|
Author: {u'username': u'stbrody', u'name': u'Spencer T Brody', u'email': u'spencer@mongodb.com'}Message: |