Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 5.0.25
Affects Version/s: 4.0.23
Component/s: Replication
Labels:
None

Backwards Compatibility:
Fully Compatible
Operating System:
Linux
Steps To Reproduce:
Hide

Start a 3-node replica-set.

Start a single node replica-set for config server.

Start a mongos server.

Start a workload with multiple client threads (e.g., 50) running a mixture of find/update operations against the primary (through mongos).

Freeze the dbpath for the primary (e.g., fsfreeze --freeze /mnt/primary).

Ask the primary to step down.
Show
Start a 3-node replica-set. Start a single node replica-set for config server. Start a mongos server. Start a workload with multiple client threads (e.g., 50) running a mixture of find/update operations against the primary (through mongos ). Freeze the dbpath for the primary (e.g., fsfreeze --freeze /mnt/primary ). Ask the primary to step down.
Sprint:
Repl 2021-07-12, Repl 2021-07-26, Repl 2021-08-09, Repl 2021-08-23, Replication 2021-11-15, Replication 2021-11-29, Replication 2021-12-13, Replication 2021-12-27, Replication 2022-01-10, Replication 2022-01-24, Replication 2022-02-07
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Sending a step down request to a primary that is experiencing disk failures could result in consistent time-out errors:

{
        "operationTime" : Timestamp(1620337238, 857),
        "ok" : 0,
        "errmsg" : "Could not acquire the global shared lock before the deadline for stepdown",
        "code" : 262,
        "codeName" : "ExceededTimeLimit",
        "$gleStats" : {
                "lastOpTime" : Timestamp(0, 0),
                "electionId" : ObjectId("7fffffff0000000000000001")
        },
        "lastCommittedOpTime" : Timestamp(1620337238, 327),
        "$configServerState" : {
                "opTime" : {
                        "ts" : Timestamp(1620337306, 1),
                        "t" : NumberLong(1)
                }
        },
        "$clusterTime" : {
                "clusterTime" : Timestamp(1620337306, 1),
                "signature" : {
                        "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
                        "keyId" : NumberLong(0)
                }
        }
}

The error is returned from here and the behavior is easy to reproduce. I've tested the behavior on v4.0.23.

Also, I tried to attach GDB to the primary to collect stack-traces, but GDB hangs and I haven't been able to find an alternative yet.

is related to

SERVER-71520 Dump all thread stacks on RSTL acquisition timeout

Closed

related to

SERVER-65766 ShardingStateRecovery makes remote calls to config server while holding the RSTL

Closed

SERVER-65825 Increase fassertOnLockTimeoutForStepUpDown default timeout to 30 seconds

Closed

SERVER-61251 Ensure long running storage engine operations are interruptible

Closed

Assignee:: Adi Zaimi
Reporter:: Amirsaman Memaripour
Participants:: Adi Zaimi, Amirsaman Memaripour, Githook User, Lingzhi Deng
Votes:: 0 Vote for this issue
Watchers:: 12 Start watching this issue

Created:: May 07 2021 02:46:28 PM UTC
Updated:: Apr 22 2024 05:39:24 PM UTC
Resolved:: Jan 24 2022 09:45:05 PM UTC
Confidence Status Last Update:: 22/Dec/21 8:44 PM

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates