Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 3.6.6, 4.0.1, 4.1.1
Affects Version/s: None
Component/s: Replication, Sharding
Labels:
None

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v4.0, v3.6
Sprint:
TIG 2018-07-02
Linked BF Score:
15
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

On some occasions, right around the time when the ContinousStepdownThread issued a replSetStepdown command, the MMAP V1 storage engine started flushing the file mappings, which took 13 seconds. Since this happens under a lock, the stepdown thread was not able to acquire the global S lock for 10 seconds and failed.

Here are the relevant logs:

[ContinuousStepdown:job0] Stepping down the primary on port 20005 of replica set 'shard-rs1'.
...
[ShardedClusterFixture:job0:shard1:node0] 2018-05-19T00:17:31.286+0000 I STORAGE  [conn70] allocating new ns file \data\db\job0\resmoke\shard1\node0\test.ns, filling with zeroes...
[ShardedClusterFixture:job0:shard1:node0] 2018-05-19T00:17:31.494+0000 I STORAGE  [conn70] done allocating ns file \data\db\job0\resmoke\shard1\node0\test.ns, size: 16MB, took 0.208 seconds
[ShardedClusterFixture:job0:shard1:node0] 2018-05-19T00:17:31.495+0000 I STORAGE  [FileAllocator] allocating new datafile \data\db\job0\resmoke\shard1\node0\test.0
[ShardedClusterFixture:job0:shard1:node0] 2018-05-19T00:17:31.496+0000 I STORAGE  [FileAllocator] done allocating datafile \data\db\job0\resmoke\shard1\node0\test.0, size: 64MB,  took 0 secs
...
[ShardedClusterFixture:job0:shard1:node0] 2018-05-19T00:17:31.837+0000 I COMMAND  [conn81] Attempting to step down in response to replSetStepDown command
[ShardedClusterFixture:job0:shard1:node0] 2018-05-19T00:17:31.837+0000 I SHARDING [conn70] Failed to refresh databaseVersion for database test :: caused by :: InterruptedDueToReplStateChange: operation was interrupted
[ShardedClusterFixture:job0:shard1:node0] 2018-05-19T00:17:31.837+0000 I COMMAND  [conn70] command test.$cmd appName: "MongoDB Shell" command: createIndexes { createIndexes: "jstests_explain_upsert", indexes: [ { key: { key: 1.0 }, name: "key_1" } ], shardVersion: [ Timestamp(0, 0), ObjectId('000000000000000000000000') ], databaseVersion: { uuid: UUID("c47419a1-e264-48de-99ff-fdfd9f6c0a70"), lastMod: 1 }, allowImplicitCollectionCreation: false, $clusterTime: { clusterTime: Timestamp(1526689051, 9), signature: { hash: BinData(0, FE36970C885D756C5AE540D195D5F97D3CA7BAB4), keyId: 6557079175839088663 } }, $client: { application: { name: "MongoDB Shell" }, driver: { name: "MongoDB Internal Client", version: "3.7.9-343-g6ab1592260" }, os: { type: "Windows", name: "Microsoft Windows Server 2008 R2", architecture: "x86_64", version: "6.1 SP1 (build 7601)" }, mongos: { host: "WIN-E7OL17UNDC0:20007", client: "127.0.0.1:55951", version: "3.7.9-343-g6ab1592260" } }, $configServerState: { opTime: { ts: Timestamp(1526689051, 9), t: 3 } }, $db: "test" } numYields:0 ok:0 errMsg:"don't know dbVersion" errName:StaleDbVersion errCode:249 reslen:474 locks:{ Global: { acquireCount: { r: 3, w: 1 }, acquireWaitCount: { r: 1 } }, MMAPV1Journal: { acquireCount: { w: 5 } }, Database: { acquireCount: { W: 1 } }, Metadata: { acquireCount: { W: 1 } }, Mutex: { acquireCount: { r: 1, W: 4 } } } protocol:op_msg 551ms
...
[ShardedClusterFixture:job0:mongos] 2018-05-19T00:17:31.838+0000 I NETWORK  [TaskExecutorPool-0] Marking host localhost:20005 as failed :: caused by :: InterruptedDueToReplStateChange: operation was interrupted
...
[ShardedClusterFixture:job0:shard1:node0] 2018-05-19T00:17:41.789+0000 D REPL_HB  [replexec-2] Received response to heartbeat (requestId: 66) from localhost:20006, { ok: 1.0, state: 2, v: 2, hbmsg: "", set: "shard-rs1", syncingTo: "localhost:20005", term: 3, primaryId: 0, durableOpTime: { ts: Timestamp(1526689051, 12), t: 3 }, opTime: { ts: Timestamp(1526689051, 12), t: 3 }, operationTime: Timestamp(1526689053, 1), $replData: { term: 3, lastOpCommitted: { ts: Timestamp(1526689043, 1), t: 3 }, lastOpVisible: { ts: Timestamp(1526689043, 1), t: 3 }, configVersion: 2, replicaSetId: ObjectId('5aff6ce0c38a7d74bfa6bff8'), primaryIndex: 0, syncSourceIndex: 0 }, $gleStats: { lastOpTime: Timestamp(0, 0), electionId: ObjectId('7fffffff0000000000000002') }, lastCommittedOpTime: Timestamp(1526689043, 1), $configServerState: { opTime: { ts: Timestamp(1526689041, 21), t: 2 } }, $clusterTime: { clusterTime: Timestamp(1526689053, 1), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } } }
[ShardedClusterFixture:job0:shard1:node0] 2018-05-19T00:17:41.789+0000 D REPL_HB  [replexec-2] Scheduling heartbeat to localhost:20006 at 2018-05-19T00:17:43.789Z
...
[ShardedClusterFixture:job0:shard1:node0] 2018-05-19T00:17:41.836+0000 I COMMAND  [conn81] command admin.$cmd command: replSetStepDown { replSetStepDown: 10, force: true, $readPreference: { mode: "secondaryPreferred" }, $db: "admin" } numYields:0 ok:0 errMsg:"Could not acquire the global shared lock within the amount of time specified that we should step down for" errName:ExceededTimeLimit errCode:50 reslen:488 locks:{ Global: { acquireCount: { r: 1, W: 1 }, acquireWaitCount: { W: 1 }, timeAcquiringMicros: { W: 10000225 } } } protocol:op_query 10000ms
...
[ContinuousStepdown:job0] Error while stepping down the primary on port 20005 of replica set 'shard-rs1'.
...
[ShardedClusterFixture:job0:shard1:node0] 2018-05-19T00:17:43.711+0000 I STORAGE  [DataFileSync] flushing mmaps took 13151ms  for 7 files

Assignee:: Max Hirschhorn
Reporter:: Kaloian Manassiev
Participants:: Eric Milkie, Githook User, Judah Schvimer, Kaloian Manassiev, Max Hirschhorn
Votes:: 0 Vote for this issue
Watchers:: 6 Start watching this issue

Created:: May 21 2018 04:45:05 PM UTC
Updated:: Oct 29 2023 10:31:36 PM UTC
Resolved:: Jun 23 2018 10:43:55 PM UTC
Confidence Status Last Update:: 21/Jun/18 12:03 AM

Details

Description

Attachments

Forms

Activity

People

Dates