[SERVER-26970] isMaster can return isMaster: true while in drain mode Created: 09/Nov/16 Updated: 24/Mar/17 Resolved: 10/Mar/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Judah Schvimer | Assignee: | Siyuan Zhou |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Sprint: | Repl 2017-01-23, Repl 2017-02-13, Repl 2017-03-06, Repl 2017-03-27 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Linked BF Score: | 0 | ||||||||||||||||||||
| Description |
|
Earlier in this code block _isWaitingForDrainToComplete is set to false, and at that point isMaster will begin to return true. This is because isMaster does not wait for _canAcceptNonLocalWrites to be true to return true (which it probably should). This, however, occurs before we write the "new primary" noop to the server and log that "transition to primary complete; database writes are now permitted", which should happen before drain mode is complete (especially the "new primary" noop). This code block also unlocks and relocks its mutex without checking if the state has changed afterwards, which should be reconsidered. Related, but tangential is that _isWaitingForDrainToComplete and _isCatchingUp and _canAcceptNonLocalWrites are all related but currently function as 3 separate booleans. We should investigate if they can be combined into an enum. |
| Comments |
| Comment by Spencer Brody (Inactive) [ 10/Mar/17 ] |
|
The original issue this ticket was describing was fixed in https://github.com/mongodb/mongo/commit/1da3111dc238698e4e70672b7ba260a368121e50 as part of |
| Comment by Siyuan Zhou [ 10/Mar/17 ] |
|
_canAcceptNonLocalWrites is checked in isMaster as part of Regarding the lock and unlock of mutex, the state of replication cannot change because we are holding the global lock, so it's safe to re-acquire the mutex without checking the state. |
| Comment by Siyuan Zhou [ 08/Mar/17 ] |
|
We set _canAcceptNonLocalWrites before writing the "new primary" no-op, so the linked BF's run ReplSetTest.awaitReplication and see isMaste: true and the last op before the no-op. They don't expect the new no-op and failed. The HELP ticket is in a different scenario. The shard was blocked on stepping up because it cannot see the config servers. _canAcceptNonLocalWrites was set before that so a client thought the primary is ready to accept new writes. Discussed with kaloian.manassiev, I believe we cannot just move _canAcceptNonLocalWrites to after enabling sharding, because sharding needs to write to $admin database. Without seeing the config servers, no writes can be accepted anyway, so it's more about informative error messages. spencer, we could fix the BF's by setting _canAcceptNonLocalWrites after the no-op write and before enabling sharding, but that wouldn't fix the HELP ticket. |
| Comment by Spencer Brody (Inactive) [ 07/Mar/17 ] |
|
We've had users hit this issue and be pointed at this ticket to wait for the fix. Given that the behavior has been fixed I think we should ahead and close this ticket out. If you think there is something in the way we acquire and release mutexes during stepdown that needs to be looked at, please file a new ticket for that. I don't think it's guaranteed that |
| Comment by Siyuan Zhou [ 07/Mar/17 ] |
|
isMaster itself was fixed by |
| Comment by Spencer Brody (Inactive) [ 07/Mar/17 ] |
|
siyuan.zhou, was this fixed as part of |
| Comment by Siyuan Zhou [ 08/Dec/16 ] |
|
In The code in isMaster was audited in my patch (line 2059) and will be changed to check _canAcceptNonLocalWrites. The lock and unlock should still be reconsidered though. |
| Comment by Eric Milkie [ 09/Nov/16 ] |
|
We already had a ticket to make those 3 an enum earlier this year. The difficulty was that "_canAcceptNonLocalWrites" has special locking behavior that is not shared with the other bools. |