[SERVER-26970] isMaster can return isMaster: true while in drain mode Created: 09/Nov/16  Updated: 24/Mar/17  Resolved: 10/Mar/17

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Judah Schvimer Assignee: Siyuan Zhou
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
duplicates SERVER-27120 Increase synchronization between prod... Closed
Related
related to SERVER-28269 Set _canAcceptNonLocalWrites at the e... Closed
Operating System: ALL
Sprint: Repl 2017-01-23, Repl 2017-02-13, Repl 2017-03-06, Repl 2017-03-27
Participants:
Linked BF Score: 0

 Description   

Earlier in this code block _isWaitingForDrainToComplete is set to false, and at that point isMaster will begin to return true. This is because isMaster does not wait for _canAcceptNonLocalWrites to be true to return true (which it probably should). This, however, occurs before we write the "new primary" noop to the server and log that "transition to primary complete; database writes are now permitted", which should happen before drain mode is complete (especially the "new primary" noop).

This code block also unlocks and relocks its mutex without checking if the state has changed afterwards, which should be reconsidered.

Related, but tangential is that _isWaitingForDrainToComplete and _isCatchingUp and _canAcceptNonLocalWrites are all related but currently function as 3 separate booleans. We should investigate if they can be combined into an enum.



 Comments   
Comment by Spencer Brody (Inactive) [ 10/Mar/17 ]

The original issue this ticket was describing was fixed in https://github.com/mongodb/mongo/commit/1da3111dc238698e4e70672b7ba260a368121e50 as part of SERVER-27120. Any additional fixes to the process of transitioning to primary should happen in different tickets.

Comment by Siyuan Zhou [ 10/Mar/17 ]

_canAcceptNonLocalWrites is checked in isMaster as part of SERVER-27120. SERVER-28269 is filed for the proposed fix and this one will be resolved as a dup to SERVER-28269 when it is finished.

Regarding the lock and unlock of mutex, the state of replication cannot change because we are holding the global lock, so it's safe to re-acquire the mutex without checking the state.

Comment by Siyuan Zhou [ 08/Mar/17 ]

We set _canAcceptNonLocalWrites before writing the "new primary" no-op, so the linked BF's run ReplSetTest.awaitReplication and see isMaste: true and the last op before the no-op. They don't expect the new no-op and failed.

The HELP ticket is in a different scenario. The shard was blocked on stepping up because it cannot see the config servers. _canAcceptNonLocalWrites was set before that so a client thought the primary is ready to accept new writes. Discussed with kaloian.manassiev, I believe we cannot just move _canAcceptNonLocalWrites to after enabling sharding, because sharding needs to write to $admin database. Without seeing the config servers, no writes can be accepted anyway, so it's more about informative error messages.

spencer, we could fix the BF's by setting _canAcceptNonLocalWrites after the no-op write and before enabling sharding, but that wouldn't fix the HELP ticket.

Comment by Spencer Brody (Inactive) [ 07/Mar/17 ]

We've had users hit this issue and be pointed at this ticket to wait for the fix. Given that the behavior has been fixed I think we should ahead and close this ticket out. If you think there is something in the way we acquire and release mutexes during stepdown that needs to be looked at, please file a new ticket for that. I don't think it's guaranteed that SERVER-27892 will wind up addressing that.

Comment by Siyuan Zhou [ 07/Mar/17 ]

isMaster itself was fixed by SERVER-27120, but the lock and unlock of mutex still looks suspicious. Is it covered by SERVER-27892? If not, we can keep this ticket open to track that.

Comment by Spencer Brody (Inactive) [ 07/Mar/17 ]

siyuan.zhou, was this fixed as part of SERVER-27120?

Comment by Siyuan Zhou [ 08/Dec/16 ]

In SERVER-27120, I tried to separate the states of bgsync producer and applier from the replication coordinator. After that work, I'll remove _isWaitingForDrainToComplete and _isCatchingUp to favor bgsync producer and applier's own states. Drain mode and catchup mode actually don't affect replication coordinator's behavior. They only affect bgsync and applier's behavior.

The code in isMaster was audited in my patch (line 2059) and will be changed to check _canAcceptNonLocalWrites.

The lock and unlock should still be reconsidered though.

Comment by Eric Milkie [ 09/Nov/16 ]

We already had a ticket to make those 3 an enum earlier this year. The difficulty was that "_canAcceptNonLocalWrites" has special locking behavior that is not shared with the other bools.

Generated at Thu Feb 08 04:13:45 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.