[SERVER-42843] shutting down a node while in PRIMARY state will flap writability flag Created: 16/Aug/19  Updated: 27/Oct/23  Resolved: 19/Aug/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Eric Milkie Assignee: Backlog - Replication Team
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-42864 change index build initial write time... Closed
Assigned Teams:
Replication
Participants:

 Description   

When a node is shut down, either via a signal like SIGINT or via the shutdown command, the shutdown proceedings will attempt a stepdown. This stepdown attempt will toggle the canAcceptNonLocalWrites flag without actually changing the state of the node, and it will also do this while acquiring and releasing the RSTL. This breaks some assumptions for code that is using the canAcceptNonLocalWrites flag as an indication of PRIMARY versus SECONDARY state when performing unreplicated writes such as index builds.

You don't even need a successful shutdown for this situation to occur. Here is a log from a one-node replica set after I ran the shutdown command from the shell; it fails because I didn't use the force flag:

2019-08-16T08:51:59.365-0400 I REPL [rsSync-0] setCanAcceptNonLocalWrites 1
2019-08-16T08:51:59.365-0400 I REPL [rsSync-0] transition to primary complete; database writes are now permitted
2019-08-16T08:51:59.365-0400 I REPL [rsSync-0] RSTL unlock
2019-08-16T08:51:59.365-0400 I SHARDING [monitoring-keys-for-HMAC] Marking collection admin.system.keys as collection version: <unsharded>
2019-08-16T08:52:09.379-0400 I NETWORK [listener] connection accepted from 127.0.0.1:32836 #1 (1 connection now open)
2019-08-16T08:52:09.380-0400 I NETWORK [conn1] received client metadata from 127.0.0.1:32836 conn1: { application: { name: "MongoDB Shell" }, driver: { name: "MongoDB Internal Client", version: "0.0.0" }, os: { type: "Linux", name: "Fedora release 30 (Thirty)", architecture: "x86_64", version: "Kernel 5.1.20-300.fc30.x86_64" } }
2019-08-16T08:52:11.505-0400 I REPL [conn1] RSTL lock
2019-08-16T08:52:11.505-0400 I REPL [RstlKillOpThread] Starting to kill user operations
2019-08-16T08:52:11.505-0400 I REPL [RstlKillOpThread] Stopped killing user operations
2019-08-16T08:52:11.505-0400 I REPL [conn1] setCanAcceptNonLocalWrites 0
2019-08-16T08:52:11.505-0400 I REPL [conn1] RSTL unlock
2019-08-16T08:52:21.516-0400 I REPL [conn1] RSTL lock
2019-08-16T08:52:21.516-0400 I REPL [RstlKillOpThread] Starting to kill user operations
2019-08-16T08:52:21.516-0400 I REPL [RstlKillOpThread] Stopped killing user operations
2019-08-16T08:52:21.516-0400 I REPL [conn1] setCanAcceptNonLocalWrites 1
2019-08-16T08:52:21.516-0400 I REPL [conn1] RSTL unlock
2019-08-16T08:52:21.517-0400 I COMMAND [conn1] command admin.$cmd appName: "MongoDB Shell" command: shutdown { shutdown: 1.0, lsid: { id: UUID("b19857ea-57bf-410a-8d97-db96b1c2bfef") }, $clusterTime: { clusterTime: Timestamp(1565959929, 1), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } }, $db: "admin" } numYields:0 ok:0 errMsg:"No electable secondaries caught up as of 2019-08-16T08:52:21.516-0400. Please use the replSetStepDown command with the argument {force: true} to force node to step down." errName:ExceededTimeLimit errCode:262 reslen:387 locks:{ ReplicationStateTransition: { acquireCount: { W: 2 } } } protocol:op_msg 10011ms

The issue is that between setting the canAcceptWrites flag to 0 and setting it to 1, the RSTL is unlocked and locked again, thus allowing other operations the ability to lock the database and observe a state where the node is PRIMARY and yet canAcceptWrites is false.



 Comments   
Comment by Siyuan Zhou [ 20/Aug/19 ]

Thanks Eric. Discussed offline. Index build is special because it can do unreplicated writes sometimes, which needs to know the state of replication to assign a right timestamp. Simultaneous index build will solve this issue by making all writes replicated.

Comment by Eric Milkie [ 19/Aug/19 ]

What assumptions are you referring to here?

The exact assumption I mentioned in the text, "using the canAcceptNonLocalWrites flag as an indication of PRIMARY versus SECONDARY state". This ticket is describing a situation where my code can lock the RSTL, check canAcceptNonLocalWrites, see that it is false, and yet have the node be in state PRIMARY (and continue to be in state PRIMARY indefinitely afterwards).

Comment by Siyuan Zhou [ 19/Aug/19 ]

This stepdown attempt will toggle the canAcceptNonLocalWrites flag without actually changing the state of the node, and it will also do this while acquiring and releasing the RSTL.

I believe shutdown will change its state to Secondary if the conditional stepdown succeeds.

This breaks some assumptions for code that is using the canAcceptNonLocalWrites flag as an indication of PRIMARY versus SECONDARY state when performing unreplicated writes such as index builds.

The contract is that holding RSTL guarantees that canAcceptNonLocalWrites doesn't change. One RSTL is released, the node is free to change its state. What assumptions are you referring to here?

Comment by Eric Milkie [ 19/Aug/19 ]

If this is something we'd like to simplify in the future, is there a ticket to indicate that? I was expecting to be able to use this ticket for that work but it's now closed.

Comment by Eric Milkie [ 16/Aug/19 ]

I filed a separate ticket (linked) to fix the BF here; you can either put this ticket in your backlog to do something about eventually or just resolve it for now.

Comment by Judah Schvimer [ 16/Aug/19 ]

Assigning back to Eric after a discussion about a different solution. While this behavior is not the easiest contract, and something we'd like to simplify in the future, it is not necessarily a bug in the canAcceptNonLocalWrites flag.

Comment by Eric Milkie [ 16/Aug/19 ]

The indexing code is doing it to determine what timestamp to use when doing the initial catalog write to start an index build. It's either the timestamp associated with the oplog entry if on a secondary, or a timestamp generated by attempting to write a no-op oplog entry, or no timestamp if we can't do that because we're in initial sync or startup recovery.

Comment by Judah Schvimer [ 16/Aug/19 ]

What code is checking PRIMARY vs. SECONDARY to make a decision? I'm suspicious of that code as well.

Generated at Thu Feb 08 05:01:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.