Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Works as Designed
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: Replication
Labels:
None

Assigned Teams:

Replication
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

When a node is shut down, either via a signal like SIGINT or via the shutdown command, the shutdown proceedings will attempt a stepdown. This stepdown attempt will toggle the canAcceptNonLocalWrites flag without actually changing the state of the node, and it will also do this while acquiring and releasing the RSTL. This breaks some assumptions for code that is using the canAcceptNonLocalWrites flag as an indication of PRIMARY versus SECONDARY state when performing unreplicated writes such as index builds.

You don't even need a successful shutdown for this situation to occur. Here is a log from a one-node replica set after I ran the shutdown command from the shell; it fails because I didn't use the force flag:

2019-08-16T08:51:59.365-0400 I REPL [rsSync-0] setCanAcceptNonLocalWrites 1
2019-08-16T08:51:59.365-0400 I REPL [rsSync-0] transition to primary complete; database writes are now permitted
2019-08-16T08:51:59.365-0400 I REPL [rsSync-0] RSTL unlock
2019-08-16T08:51:59.365-0400 I SHARDING [monitoring-keys-for-HMAC] Marking collection admin.system.keys as collection version: <unsharded>
2019-08-16T08:52:09.379-0400 I NETWORK [listener] connection accepted from 127.0.0.1:32836 #1 (1 connection now open)
2019-08-16T08:52:09.380-0400 I NETWORK [conn1] received client metadata from 127.0.0.1:32836 conn1: { application: { name: "MongoDB Shell" }, driver: { name: "MongoDB Internal Client", version: "0.0.0" }, os: { type: "Linux", name: "Fedora release 30 (Thirty)", architecture: "x86_64", version: "Kernel 5.1.20-300.fc30.x86_64" } }
2019-08-16T08:52:11.505-0400 I REPL [conn1] RSTL lock
2019-08-16T08:52:11.505-0400 I REPL [RstlKillOpThread] Starting to kill user operations
2019-08-16T08:52:11.505-0400 I REPL [RstlKillOpThread] Stopped killing user operations
2019-08-16T08:52:11.505-0400 I REPL [conn1] setCanAcceptNonLocalWrites 0
2019-08-16T08:52:11.505-0400 I REPL [conn1] RSTL unlock
2019-08-16T08:52:21.516-0400 I REPL [conn1] RSTL lock
2019-08-16T08:52:21.516-0400 I REPL [RstlKillOpThread] Starting to kill user operations
2019-08-16T08:52:21.516-0400 I REPL [RstlKillOpThread] Stopped killing user operations
2019-08-16T08:52:21.516-0400 I REPL [conn1] setCanAcceptNonLocalWrites 1
2019-08-16T08:52:21.516-0400 I REPL [conn1] RSTL unlock
2019-08-16T08:52:21.517-0400 I COMMAND [conn1] command admin.$cmd appName: "MongoDB Shell" command: shutdown { shutdown: 1.0, lsid: { id: UUID("b19857ea-57bf-410a-8d97-db96b1c2bfef") }, $clusterTime: { clusterTime: Timestamp(1565959929, 1), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } }, $db: "admin" } numYields:0 ok:0 errMsg:"No electable secondaries caught up as of 2019-08-16T08:52:21.516-0400. Please use the replSetStepDown command with the argument {force: true} to force node to step down." errName:ExceededTimeLimit errCode:262 reslen:387 locks:{ ReplicationStateTransition: { acquireCount: { W: 2 } } } protocol:op_msg 10011ms

The issue is that between setting the canAcceptWrites flag to 0 and setting it to 1, the RSTL is unlocked and locked again, thus allowing other operations the ability to lock the database and observe a state where the node is PRIMARY and yet canAcceptWrites is false.

related to

SERVER-42864 change index build initial write timestamp logic

Closed

Assignee:: [DO NOT USE] Backlog - Replication Team
Reporter:: Eric Milkie
Participants:: [DO NOT USE] Backlog - Replication Team, Eric Milkie, Judah Schvimer, Siyuan Zhou
Votes:: 0 Vote for this issue
Watchers:: 7 Start watching this issue

Created:: Aug 16 2019 12:57:58 PM UTC
Updated:: Oct 27 2023 01:53:07 PM UTC
Resolved:: Aug 19 2019 05:14:02 PM UTC

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates