[SERVER-8786] Race condition when setting ShardingConnectionHook on mongod connection pools Created: 28/Feb/13 Updated: 11/Jul/16 Resolved: 04/Mar/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Security, Sharding |
| Affects Version/s: | 2.2.3 |
| Fix Version/s: | 2.2.4, 2.4.0-rc2 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Spencer Brody (Inactive) | Assignee: | Spencer Brody (Inactive) |
| Resolution: | Done | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||
| Operating System: | ALL | ||||
| Participants: | |||||
| Description |
|
We've seen a few cases where customers bring up a sharded cluster running with authentication and the shard primaries get errors querying the config servers saying that they are unauthenticated. This causes the system to be unusable. It appears as though the mongods aren't even trying to authenticate to the config servers, even though they successfully authenticate to the other nodes in their replica set. The problem seems to be that the ShardingConnectionHook, which also handes authenticating all connections used by sharding, isn't being set on the pool. Restarting the mongods seems to resolve the issues, which further supports my theory that this is a race condition. Investigation into the code brings us to the following function in d_state.cpp:
This is the code that is used to set the connection hook on the pools. This code is not thread-safe and there's a potential race condition that could lead to 2 connections calling addHook at the same time. Since addHook is basically just an add to an stl::list, and stl isn't thread safe, this could potentially corrupt the connection hooks linked list structure. This is my current theory as to how the ShardingConnectionHook can fail to be set. |
| Comments |
| Comment by auto [ 27/Mar/13 ] |
|
Author: {u'date': u'2013-03-26T18:47:30Z', u'name': u'Spencer T Brody', u'email': u'spencer@10gen.com'}Message: |
| Comment by auto [ 04/Mar/13 ] |
|
Author: {u'date': u'2013-03-04T16:44:24Z', u'name': u'Spencer T Brody', u'email': u'spencer@10gen.com'}Message: |
| Comment by auto [ 04/Mar/13 ] |
|
Author: {u'date': u'2013-03-04T16:23:55Z', u'name': u'Spencer T Brody', u'email': u'spencer@10gen.com'}Message: |
| Comment by auto [ 01/Mar/13 ] |
|
Author: {u'date': u'2013-02-28T22:57:51Z', u'name': u'Spencer T Brody', u'email': u'spencer@10gen.com'}Message: |
| Comment by auto [ 01/Mar/13 ] |
|
Author: {u'date': u'2013-02-28T22:01:56Z', u'name': u'Spencer T Brody', u'email': u'spencer@10gen.com'}Message: |
| Comment by auto [ 01/Mar/13 ] |
|
Author: {u'date': u'2013-02-28T22:57:51Z', u'name': u'Spencer T Brody', u'email': u'spencer@10gen.com'}Message: |
| Comment by auto [ 01/Mar/13 ] |
|
Author: {u'date': u'2013-02-28T22:01:56Z', u'name': u'Spencer T Brody', u'email': u'spencer@10gen.com'}Message: |
| Comment by Spencer Brody (Inactive) [ 28/Feb/13 ] |
|
I have reproduced this issue locally by starting up a sharded cluster with authentication and after connecting and authenticating having the very first thing run be a moveChunk. This breaks the donor shard and after that all queries hitting that shard fail. The problem is that moveChunk calls shardingState.enable and configServer.init, but doesn't set the ShardingConnectionHook. This prevents future setShardVersion calls from adding the ShardingConnectionHook, as it would think sharding is already initialized. |