[SERVER-48178] Finding self in reconfig may be interrupted by closing connections due to rollback Created: 13/May/20 Updated: 29/Oct/23 Resolved: 08/Jul/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 4.4.1, 4.7.0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Siyuan Zhou | Assignee: | Mark Benvenuto |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | safe-reconfig-related | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||
| Backport Requested: |
v4.4
|
||||||||||||||||||||||||||||
| Sprint: | Security 2020-06-29, Security 2020-07-13 | ||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||
| Linked BF Score: | 14 | ||||||||||||||||||||||||||||
| Description |
|
When a node installs a new config, it needs to create a new connection to find itself in the config. If the node start to roll back, the connection will be closed. Thus the node thinks it's not in this config and will transition to REMOVED. This issue becomes worse when authentication is enabled since more network roundtrips are involved. |
| Comments |
| Comment by Githook User [ 04/Aug/20 ] |
|
Author: {'name': 'Mark Benvenuto', 'email': 'mark.benvenuto@mongodb.com', 'username': 'markbenvenuto'}Message: (cherry picked from commit acd6c3d067a90f6cf37b6802160b505f7935218b) |
| Comment by Pavithra Vetriselvan [ 13/Jul/20 ] |
|
Nope, it's not a blocker, just a bug fix! 4.4.1 should be fine. |
| Comment by Mark Benvenuto [ 13/Jul/20 ] |
|
pavithra.vetriselvan, do we need the fix for 4.4.0? |
| Comment by Pavithra Vetriselvan [ 13/Jul/20 ] |
|
mark.benvenuto This is happening on the 4.4 branch as well, so could we request a backport? I've linked the relevant BF. |
| Comment by Githook User [ 08/Jul/20 ] |
|
Author: {'name': 'Mark Benvenuto', 'email': 'mark.benvenuto@mongodb.com', 'username': 'markbenvenuto'}Message: |
| Comment by Siyuan Zhou [ 01/Jun/20 ] |
|
judah.schvimer, good point. Updated. |
| Comment by Judah Schvimer [ 01/Jun/20 ] |
|
siyuan.zhou, should this no longer be a replication quick win? |
| Comment by Siyuan Zhou [ 26/May/20 ] |
|
I looked more into the comment "We need to avoid the isMaster call triggered by a normal connect". IIUC, it's out of date. I didn't find any caller of isSelf function within the replication mutex. If it's called within the repl mutex, authentication should have triggered deadlock reliably. We have sufficient test coverage for that, so I'm not concerned. |
| Comment by Siyuan Zhou [ 26/May/20 ] |
|
Thanks for investigating into this! I feel I don't have the context of whether hardcoding SCRAM-SHA-256 is fine, but I trust Security team on that. Assigning to Security team. |
| Comment by Spencer Jackson [ 26/May/20 ] |
|
Looking at this some more, I think that the use cases which require performing intracluster mechanism negotiation use the connection_pool_tl. That suggests that it should be safe to remove the isMaster call from client/authenticate.cpp, and hardcode SCRAM-SHA-256. That would leave the connection tags alone, as the replication system expects. siyuan.zhou, would you want to assign this over to security to make that change? |
| Comment by Siyuan Zhou [ 21/May/20 ] |
|
Right, it should be hangUpOnStepDown: false. Sorry I missed that. All internal connections that go through TLConnection will attach the flag.
I think internal connections of replication system do want to avoid closing connection. I'm not sure of sharding connections. The flag is not set on connections from drivers or the shell - those external connections. We close connections in fewer cases on stepdown now since we prefer to keep the reads or kill the writes. However, rollback or removing the node still close all connections that's not kKeepOpen as you mentioned since a rollback node cannot even serve reads. The flag name is misleading. |
| Comment by Spencer Jackson [ 21/May/20 ] |
|
Looking into this some more, I think there might be a couple of scenarios where we still do need to rely on negotiation during intracluster authentication... so we probably still do need to call isMaster in the authentication path. siyuan.zhou, I realize that I may have said the wrong thing earlier, would the isMaster invocation need to pass hangUpOnStepDown: true or hangUpOnStepDown: false? Do you have any reason to believe that under some circumstances, that flag should not be set? |
| Comment by Siyuan Zhou [ 20/May/20 ] |
Oh, yes! Thank you. I assume isMaster is the first step in authentication, so we could pass 'hangUpOnStepDown: true'. Since we always skip closing connection in kPending state, I wonder why the isMaster invocation you linked above doesn't attach the flag by default. As a separate note, my understanding is we wanted to avoid a deadlock in isSelf, so we deliberately skip isMaster in
|
| Comment by Spencer Jackson [ 20/May/20 ] |
|
siyuan.zhou, are you sure that rollBack kills everything in the kKeepOpen state? Or does it kill everything not in the kKeepOpen state? I believe that this command path can invoke isMaster during authentication, in order to perform SCRAM mechanism negotiation. However, now that all nodes in the current and previous release support SCRAM-SHA-256 for intracluster auth, we could hardcode the mechanism and avoid the isMaster roundtrip. However, that might not be everything that you need, if 'hangUpOnStepDown: true' has be set on an isMaster invokation. Maybe authenticateInternalUser should stop calling isMaster, and you should start calling it in isSelf? |
| Comment by Siyuan Zhou [ 18/May/20 ] |
|
I worked around this problem by not making concurrent reconfig and rollback more likely in One solution is to avoid closing connection during authentication and isSelf command. Rollback closes connections in kKeepOpen state. kKeepOpen state is only set on isMaster. spencer.jackson, I'm not sure why the connection for isSelf can still be closed. I don't think we run isMaster for isSelf check at all. This issue only showed up when auth is enabled, which I thought is due to a wider race window. |