[SERVER-35124] Stepdown suites with MMAP V1 often fail due to `flushing mmaps` taking long time Created: 21/May/18  Updated: 29/Oct/23  Resolved: 23/Jun/18

Status: Closed
Project: Core Server
Component/s: Replication, Sharding
Affects Version/s: None
Fix Version/s: 3.6.6, 4.0.1, 4.1.1

Type: Bug Priority: Major - P3
Reporter: Kaloian Manassiev Assignee: Max Hirschhorn
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.0, v3.6
Sprint: TIG 2018-07-02
Participants:
Linked BF Score: 15

 Description   

On some occasions, right around the time when the ContinousStepdownThread issued a replSetStepdown command, the MMAP V1 storage engine started flushing the file mappings, which took 13 seconds. Since this happens under a lock, the stepdown thread was not able to acquire the global S lock for 10 seconds and failed.

Here are the relevant logs:

[ContinuousStepdown:job0] Stepping down the primary on port 20005 of replica set 'shard-rs1'.
...
[ShardedClusterFixture:job0:shard1:node0] 2018-05-19T00:17:31.286+0000 I STORAGE  [conn70] allocating new ns file \data\db\job0\resmoke\shard1\node0\test.ns, filling with zeroes...
[ShardedClusterFixture:job0:shard1:node0] 2018-05-19T00:17:31.494+0000 I STORAGE  [conn70] done allocating ns file \data\db\job0\resmoke\shard1\node0\test.ns, size: 16MB, took 0.208 seconds
[ShardedClusterFixture:job0:shard1:node0] 2018-05-19T00:17:31.495+0000 I STORAGE  [FileAllocator] allocating new datafile \data\db\job0\resmoke\shard1\node0\test.0
[ShardedClusterFixture:job0:shard1:node0] 2018-05-19T00:17:31.496+0000 I STORAGE  [FileAllocator] done allocating datafile \data\db\job0\resmoke\shard1\node0\test.0, size: 64MB,  took 0 secs
...
[ShardedClusterFixture:job0:shard1:node0] 2018-05-19T00:17:31.837+0000 I COMMAND  [conn81] Attempting to step down in response to replSetStepDown command
[ShardedClusterFixture:job0:shard1:node0] 2018-05-19T00:17:31.837+0000 I SHARDING [conn70] Failed to refresh databaseVersion for database test :: caused by :: InterruptedDueToReplStateChange: operation was interrupted
[ShardedClusterFixture:job0:shard1:node0] 2018-05-19T00:17:31.837+0000 I COMMAND  [conn70] command test.$cmd appName: "MongoDB Shell" command: createIndexes { createIndexes: "jstests_explain_upsert", indexes: [ { key: { key: 1.0 }, name: "key_1" } ], shardVersion: [ Timestamp(0, 0), ObjectId('000000000000000000000000') ], databaseVersion: { uuid: UUID("c47419a1-e264-48de-99ff-fdfd9f6c0a70"), lastMod: 1 }, allowImplicitCollectionCreation: false, $clusterTime: { clusterTime: Timestamp(1526689051, 9), signature: { hash: BinData(0, FE36970C885D756C5AE540D195D5F97D3CA7BAB4), keyId: 6557079175839088663 } }, $client: { application: { name: "MongoDB Shell" }, driver: { name: "MongoDB Internal Client", version: "3.7.9-343-g6ab1592260" }, os: { type: "Windows", name: "Microsoft Windows Server 2008 R2", architecture: "x86_64", version: "6.1 SP1 (build 7601)" }, mongos: { host: "WIN-E7OL17UNDC0:20007", client: "127.0.0.1:55951", version: "3.7.9-343-g6ab1592260" } }, $configServerState: { opTime: { ts: Timestamp(1526689051, 9), t: 3 } }, $db: "test" } numYields:0 ok:0 errMsg:"don't know dbVersion" errName:StaleDbVersion errCode:249 reslen:474 locks:{ Global: { acquireCount: { r: 3, w: 1 }, acquireWaitCount: { r: 1 } }, MMAPV1Journal: { acquireCount: { w: 5 } }, Database: { acquireCount: { W: 1 } }, Metadata: { acquireCount: { W: 1 } }, Mutex: { acquireCount: { r: 1, W: 4 } } } protocol:op_msg 551ms
...
[ShardedClusterFixture:job0:mongos] 2018-05-19T00:17:31.838+0000 I NETWORK  [TaskExecutorPool-0] Marking host localhost:20005 as failed :: caused by :: InterruptedDueToReplStateChange: operation was interrupted
...
[ShardedClusterFixture:job0:shard1:node0] 2018-05-19T00:17:41.789+0000 D REPL_HB  [replexec-2] Received response to heartbeat (requestId: 66) from localhost:20006, { ok: 1.0, state: 2, v: 2, hbmsg: "", set: "shard-rs1", syncingTo: "localhost:20005", term: 3, primaryId: 0, durableOpTime: { ts: Timestamp(1526689051, 12), t: 3 }, opTime: { ts: Timestamp(1526689051, 12), t: 3 }, operationTime: Timestamp(1526689053, 1), $replData: { term: 3, lastOpCommitted: { ts: Timestamp(1526689043, 1), t: 3 }, lastOpVisible: { ts: Timestamp(1526689043, 1), t: 3 }, configVersion: 2, replicaSetId: ObjectId('5aff6ce0c38a7d74bfa6bff8'), primaryIndex: 0, syncSourceIndex: 0 }, $gleStats: { lastOpTime: Timestamp(0, 0), electionId: ObjectId('7fffffff0000000000000002') }, lastCommittedOpTime: Timestamp(1526689043, 1), $configServerState: { opTime: { ts: Timestamp(1526689041, 21), t: 2 } }, $clusterTime: { clusterTime: Timestamp(1526689053, 1), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } } }
[ShardedClusterFixture:job0:shard1:node0] 2018-05-19T00:17:41.789+0000 D REPL_HB  [replexec-2] Scheduling heartbeat to localhost:20006 at 2018-05-19T00:17:43.789Z
...
[ShardedClusterFixture:job0:shard1:node0] 2018-05-19T00:17:41.836+0000 I COMMAND  [conn81] command admin.$cmd command: replSetStepDown { replSetStepDown: 10, force: true, $readPreference: { mode: "secondaryPreferred" }, $db: "admin" } numYields:0 ok:0 errMsg:"Could not acquire the global shared lock within the amount of time specified that we should step down for" errName:ExceededTimeLimit errCode:50 reslen:488 locks:{ Global: { acquireCount: { r: 1, W: 1 }, acquireWaitCount: { W: 1 }, timeAcquiringMicros: { W: 10000225 } } } protocol:op_query 10000ms
...
[ContinuousStepdown:job0] Error while stepping down the primary on port 20005 of replica set 'shard-rs1'.
...
[ShardedClusterFixture:job0:shard1:node0] 2018-05-19T00:17:43.711+0000 I STORAGE  [DataFileSync] flushing mmaps took 13151ms  for 7 files



 Comments   
Comment by Githook User [ 03/Jul/18 ]

Author:

{'username': 'visemet', 'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com'}

Message: SERVER-35124 Tolerate replSetStepDown failing in stepdown suites.

It is possible for a database operation to prevent the global X lock
from being acquired within 10 seconds. We'll simply retry 8 seconds
later.

(cherry picked from commit d7ed31017007fd5963390247e6ae68714cb6a61c)
Branch: v3.6
https://github.com/mongodb/mongo/commit/393c0da8464d2e3aea20d17168f529c00466d6f3

Comment by Githook User [ 03/Jul/18 ]

Author:

{'username': 'visemet', 'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com'}

Message: SERVER-35124 Tolerate replSetStepDown failing in stepdown suites.

It is possible for a database operation to prevent the global X lock
from being acquired within 10 seconds. We'll simply retry 8 seconds
later.

(cherry picked from commit d7ed31017007fd5963390247e6ae68714cb6a61c)
Branch: v4.0
https://github.com/mongodb/mongo/commit/197a39dd137ad5d8bb1315fd3102a7a5791ae631

Comment by Githook User [ 23/Jun/18 ]

Author:

{'username': 'visemet', 'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com'}

Message: SERVER-35124 Tolerate replSetStepDown failing in stepdown suites.

It is possible for a database operation to prevent the global X lock
from being acquired within 10 seconds. We'll simply retry 8 seconds
later.
Branch: master
https://github.com/mongodb/mongo/commit/d7ed31017007fd5963390247e6ae68714cb6a61c

Comment by Judah Schvimer [ 11/Jun/18 ]

If we know when we want to be electable again, we could use the replSetFreeze: 0 command to make it electable again.

Comment by Eric Milkie [ 11/Jun/18 ]

You're right, I forgot I already filed a ticket to fix this: SERVER-35442
In my head it was already fixed..

Comment by Max Hirschhorn [ 11/Jun/18 ]

The quote above is from Kal, and I believe he intended to change the secondaryCatchUpPeriodSecs?

milkie, I don't think so. The linked BF ticket failed due to not being able to acquire the global X lock, which is controlled by the stepDownForSecs value associated with the command name itself.

Comment by Eric Milkie [ 11/Jun/18 ]

I thought the replSetStepDown command took two parameters. The parameter for the command name itself is only the freeze time after stepdown. The parameter named "secondaryCatchUpPeriodSecs" is optional and if not provided, defaults to 10 seconds.
The quote above is from Kal, and I believe he intended to change the secondaryCatchUpPeriodSecs?

Comment by Max Hirschhorn [ 09/Jun/18 ]

I have seen a few of those happening and it is just a slow machine. Max Hirschhorn how about we increase the stepdown thread timeout to 30 seconds for MMAP V1?

The downside of changing to run the {replSetStepDown: 30} command instead of the {replSetStepDown: 10} command is that while the primary will wait up to 30 seconds to acquire the global X lock, it will also be unelectable for 30 seconds after stepping down. While the Python version of the StepdownThread uses the "replSetStepUp" command to restore write available more quickly, it is best-effort and may not succeed.

I'd rather change the StepdownThread to tolerate the ExceededTimeLimit error response from the "replSetStepDown" command by swallowing the pymongo.errors.ExecutionTimeout exception. CC judah.schvimer, spencer

Generated at Thu Feb 08 04:38:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.