[SERVER-56756] Primary cannot stepDown when experiencing disk failures Created: 07/May/21  Updated: 29/Oct/23  Resolved: 24/Jan/22

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 4.0.23
Fix Version/s: 5.3.0

Type: Bug Priority: Major - P3
Reporter: Amirsaman Memaripour Assignee: Adi Zaimi
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Documented
is documented by DOCS-15060 Investigate changes in SERVER-56756: ... Closed
Related
related to SERVER-65766 ShardingStateRecovery makes remote ca... Closed
related to SERVER-65825 Increase fassertOnLockTimeoutForStepU... Closed
related to SERVER-61251 Ensure long running storage engine op... Backlog
is related to SERVER-71520 Dump all thread stacks on RSTL acquis... Closed
Backwards Compatibility: Fully Compatible
Operating System: Linux
Steps To Reproduce:
  • Start a 3-node replica-set.
  • Start a single node replica-set for config server.
  • Start a mongos server.
  • Start a workload with multiple client threads (e.g., 50) running a mixture of find/update operations against the primary (through mongos).
  • Freeze the dbpath for the primary (e.g., fsfreeze --freeze /mnt/primary).
  • Ask the primary to step down.
Sprint: Repl 2021-07-12, Repl 2021-07-26, Repl 2021-08-09, Repl 2021-08-23, Replication 2021-11-15, Replication 2021-11-29, Replication 2021-12-13, Replication 2021-12-27, Replication 2022-01-10, Replication 2022-01-24, Replication 2022-02-07
Participants:

 Description   

Sending a step down request to a primary that is experiencing disk failures could result in consistent time-out errors:

{
        "operationTime" : Timestamp(1620337238, 857),
        "ok" : 0,
        "errmsg" : "Could not acquire the global shared lock before the deadline for stepdown",
        "code" : 262,
        "codeName" : "ExceededTimeLimit",
        "$gleStats" : {
                "lastOpTime" : Timestamp(0, 0),
                "electionId" : ObjectId("7fffffff0000000000000001")
        },
        "lastCommittedOpTime" : Timestamp(1620337238, 327),
        "$configServerState" : {
                "opTime" : {
                        "ts" : Timestamp(1620337306, 1),
                        "t" : NumberLong(1)
                }
        },
        "$clusterTime" : {
                "clusterTime" : Timestamp(1620337306, 1),
                "signature" : {
                        "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
                        "keyId" : NumberLong(0)
                }
        }
}

The error is returned from here and the behavior is easy to reproduce. I've tested the behavior on v4.0.23.

Also, I tried to attach GDB to the primary to collect stack-traces, but GDB hangs and I haven't been able to find an alternative yet.



 Comments   
Comment by Adi Zaimi [ 24/Jan/22 ]

Summary of the fix (more detailed in the PR):
We added a parameter `fassertOnLockTimeoutForStepUpDown` which defaults to 15s (and can be turned off by setting it to 0), which will allow a server, which has received a request to Step Down (or Step Up), to abort in case it is unable to acquire a lock (the RSTL lock more precisely) for that time duration. This would enable a cluster with a primary node experiencing a disk failure (or other similar rare issues) to elect a new primary and thus continue to be available.

Comment by Githook User [ 24/Jan/22 ]

Author:

{'name': 'Adi Zaimi', 'email': 'adizaimi@yahoo.com', 'username': 'adizaimi'}

Message: SERVER-56756 fassert in stepup/down when RSTL timeout is above threshold
Branch: master
https://github.com/mongodb/mongo/commit/757f44e1a1cc496f8030e00249aaf43163a85677

Comment by Adi Zaimi [ 05/Jan/22 ]

Continuing discussion here instead of on the github ticket.
Lingzhi suggests that {shutdown:1, force=true} is a workaround to the issue here, but unfortunately it does not seem to be the case. Sending shutdown causes RSTL to timeout:

 

{"t":\{"$date":"2022-01-05T17:55:32.989+00:00"} ,"s":"W", "c":"COMMAND", "id":4719000, "ctx":"conn104","msg":"Error stepping down during force shutdown","attr":{"error":{"code":24,"codeName":"LockTimeout","errmsg":"Unable to acquire X lock on '{4611686018427387905: ReplicationStateTransition, 1}' within 15000ms. opId: 3113, op: conn104, connId: 104."}}}
{"t":\{"$date":"2022-01-05T17:55:32.989+00:00"},"s":"I", "c":"COMMAND", "id":4695400, "ctx":"conn104","msg":"Terminating via shutdown command","attr":{"force":true,"timeoutSecs":15}}

...
however this node never stepsdown and keeps sending heartbeads, so the other node does not step up until the FS is unfrozen:
===- Unfreeze the mount point @  Wed Jan 5 17:56:52 UTC 2022

{"t":\{"$date":"2022-01-05T17:57:03.201+00:00"}

,"s":"I", "c":"REPL", "id":6015309, "ctx":"OplogApplier-0","msg":"Logging transition to primary to oplog on stepup"}

Comment by Adi Zaimi [ 22/Dec/21 ]

Discussed with Lingzhi and creating a patch to abort() server when failing to acquire RSTL on stepDown. Opening PR to discuss this behavior further if needed.

Comment by Adi Zaimi [ 16/Dec/21 ]

I'm not sure what is the codepath of that interruption you point out, but in this case (fsfreeze) the process (well, the writer threads really) is in uninterruptible sleep – kill -9 doesn't even do anything until the fs is unfrozen. 

Comment by Lingzhi Deng [ 15/Dec/21 ]

In this case, the write operation that's blocking stepDown should have been interrupted, right? I wonder if SERVER-61251 would help here too.

Comment by Adi Zaimi [ 10/Dec/21 ]

I put an abort at the moment we timeout getting the lock, and got a core:

Thread 1 (Thread 0x7fbabef14700 (LWP 8005)):
#1  0x00007f62f9a6b921 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x00007f62f4f41bbf in mongo::LockerImpl::_lockComplete (this=0x5614b48ea680, opCtx=0x2710, resId=..., mode=<optimized out>, deadline=...) at src/mongo/db/concurrency/lock_state.cpp:1038
#3  0x00007f62f4f4a0dc in mongo::repl::ReplicationStateTransitionLockGuard::ReplicationStateTransitionLockGuard (this=0x7f62ca1cfae8, opCtx=0x7f62ca1cf130, mode=mongo::MODE_NONE) at src/mongo/db/concurrency/replication_state_transition_lock_guard.cpp:49
#4  0x00007f62f801daf0 in mongo::repl::ReplicationCoordinatorImpl::AutoGetRstlForStepUpStepDown::AutoGetRstlForStepUpStepDown (this=0x7f62ca1cfad0, repl=<optimized out>, opCtx=<optimized out>, stateTransition=<optimized out>, deadline=...) at src/mongo/db/repl/replication_coordinator_impl.cpp:2512
#5  0x00007f62f801e966 in mongo::repl::ReplicationCoordinatorImpl::stepDown (this=0x5614af75e000, opCtx=0x5614b7c20800, force=true, waitTime=..., stepdownTime=...) at src/mongo/db/repl/replication_coordinator_impl.cpp:2650
#6  0x00007f62f3457981 in mongo::repl::CmdReplSetStepDown::run (this=0x7f62f34665e0 <mongo::repl::cmdReplSetStepDown>, opCtx=0x5614b7c20800, cmdObj=..., result=...) at src/mongo/db/repl/repl_set_commands.cpp:585
#7  0x00007f62f86cfc65 in mongo::BasicCommand::runWithReplyBuilder (this=0x7f62f34665e0 <mongo::repl::cmdReplSetStepDown>, opCtx=0x7f62ca1cf130, db=<error reading variable: Cannot access memory at address 0x8>, cmdObj=<error reading variable: Cannot access memory at address 0x108248c8b48>..., replyBuilder=<optimized out>) at src/mongo/db/commands.h:953
#8  0x00007f62f50578d5 in mongo::BasicCommandWithReplyBuilderInterface::Invocation::run (this=<optimized out>, opCtx=0x5614b7c20800, result=0x5614b567c090) at src/mongo/db/commands.cpp:879
#9  0x00007f62f5047130 in mongo::CommandHelpers::runCommandInvocation (opCtx=0x5614b7c20800, request=..., invocation=0x5614b4eeb080, response=0x5614b567c090) at src/mongo/db/commands.cpp:199

The reason why attaching gdb may not work is because the process is in a strange state when one ore more of its threads are in uninterruptible sleep (which is what happens when we freeze the filesystem underneath).

Generated at Thu Feb 08 05:40:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.