Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-35124

Stepdown suites with MMAP V1 often fail due to `flushing mmaps` taking long time

    • Fully Compatible
    • ALL
    • v4.0, v3.6
    • TIG 2018-07-02
    • 15

      On some occasions, right around the time when the ContinousStepdownThread issued a replSetStepdown command, the MMAP V1 storage engine started flushing the file mappings, which took 13 seconds. Since this happens under a lock, the stepdown thread was not able to acquire the global S lock for 10 seconds and failed.

      Here are the relevant logs:

      [ContinuousStepdown:job0] Stepping down the primary on port 20005 of replica set 'shard-rs1'.
      ...
      [ShardedClusterFixture:job0:shard1:node0] 2018-05-19T00:17:31.286+0000 I STORAGE  [conn70] allocating new ns file \data\db\job0\resmoke\shard1\node0\test.ns, filling with zeroes...
      [ShardedClusterFixture:job0:shard1:node0] 2018-05-19T00:17:31.494+0000 I STORAGE  [conn70] done allocating ns file \data\db\job0\resmoke\shard1\node0\test.ns, size: 16MB, took 0.208 seconds
      [ShardedClusterFixture:job0:shard1:node0] 2018-05-19T00:17:31.495+0000 I STORAGE  [FileAllocator] allocating new datafile \data\db\job0\resmoke\shard1\node0\test.0
      [ShardedClusterFixture:job0:shard1:node0] 2018-05-19T00:17:31.496+0000 I STORAGE  [FileAllocator] done allocating datafile \data\db\job0\resmoke\shard1\node0\test.0, size: 64MB,  took 0 secs
      ...
      [ShardedClusterFixture:job0:shard1:node0] 2018-05-19T00:17:31.837+0000 I COMMAND  [conn81] Attempting to step down in response to replSetStepDown command
      [ShardedClusterFixture:job0:shard1:node0] 2018-05-19T00:17:31.837+0000 I SHARDING [conn70] Failed to refresh databaseVersion for database test :: caused by :: InterruptedDueToReplStateChange: operation was interrupted
      [ShardedClusterFixture:job0:shard1:node0] 2018-05-19T00:17:31.837+0000 I COMMAND  [conn70] command test.$cmd appName: "MongoDB Shell" command: createIndexes { createIndexes: "jstests_explain_upsert", indexes: [ { key: { key: 1.0 }, name: "key_1" } ], shardVersion: [ Timestamp(0, 0), ObjectId('000000000000000000000000') ], databaseVersion: { uuid: UUID("c47419a1-e264-48de-99ff-fdfd9f6c0a70"), lastMod: 1 }, allowImplicitCollectionCreation: false, $clusterTime: { clusterTime: Timestamp(1526689051, 9), signature: { hash: BinData(0, FE36970C885D756C5AE540D195D5F97D3CA7BAB4), keyId: 6557079175839088663 } }, $client: { application: { name: "MongoDB Shell" }, driver: { name: "MongoDB Internal Client", version: "3.7.9-343-g6ab1592260" }, os: { type: "Windows", name: "Microsoft Windows Server 2008 R2", architecture: "x86_64", version: "6.1 SP1 (build 7601)" }, mongos: { host: "WIN-E7OL17UNDC0:20007", client: "127.0.0.1:55951", version: "3.7.9-343-g6ab1592260" } }, $configServerState: { opTime: { ts: Timestamp(1526689051, 9), t: 3 } }, $db: "test" } numYields:0 ok:0 errMsg:"don't know dbVersion" errName:StaleDbVersion errCode:249 reslen:474 locks:{ Global: { acquireCount: { r: 3, w: 1 }, acquireWaitCount: { r: 1 } }, MMAPV1Journal: { acquireCount: { w: 5 } }, Database: { acquireCount: { W: 1 } }, Metadata: { acquireCount: { W: 1 } }, Mutex: { acquireCount: { r: 1, W: 4 } } } protocol:op_msg 551ms
      ...
      [ShardedClusterFixture:job0:mongos] 2018-05-19T00:17:31.838+0000 I NETWORK  [TaskExecutorPool-0] Marking host localhost:20005 as failed :: caused by :: InterruptedDueToReplStateChange: operation was interrupted
      ...
      [ShardedClusterFixture:job0:shard1:node0] 2018-05-19T00:17:41.789+0000 D REPL_HB  [replexec-2] Received response to heartbeat (requestId: 66) from localhost:20006, { ok: 1.0, state: 2, v: 2, hbmsg: "", set: "shard-rs1", syncingTo: "localhost:20005", term: 3, primaryId: 0, durableOpTime: { ts: Timestamp(1526689051, 12), t: 3 }, opTime: { ts: Timestamp(1526689051, 12), t: 3 }, operationTime: Timestamp(1526689053, 1), $replData: { term: 3, lastOpCommitted: { ts: Timestamp(1526689043, 1), t: 3 }, lastOpVisible: { ts: Timestamp(1526689043, 1), t: 3 }, configVersion: 2, replicaSetId: ObjectId('5aff6ce0c38a7d74bfa6bff8'), primaryIndex: 0, syncSourceIndex: 0 }, $gleStats: { lastOpTime: Timestamp(0, 0), electionId: ObjectId('7fffffff0000000000000002') }, lastCommittedOpTime: Timestamp(1526689043, 1), $configServerState: { opTime: { ts: Timestamp(1526689041, 21), t: 2 } }, $clusterTime: { clusterTime: Timestamp(1526689053, 1), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } } }
      [ShardedClusterFixture:job0:shard1:node0] 2018-05-19T00:17:41.789+0000 D REPL_HB  [replexec-2] Scheduling heartbeat to localhost:20006 at 2018-05-19T00:17:43.789Z
      ...
      [ShardedClusterFixture:job0:shard1:node0] 2018-05-19T00:17:41.836+0000 I COMMAND  [conn81] command admin.$cmd command: replSetStepDown { replSetStepDown: 10, force: true, $readPreference: { mode: "secondaryPreferred" }, $db: "admin" } numYields:0 ok:0 errMsg:"Could not acquire the global shared lock within the amount of time specified that we should step down for" errName:ExceededTimeLimit errCode:50 reslen:488 locks:{ Global: { acquireCount: { r: 1, W: 1 }, acquireWaitCount: { W: 1 }, timeAcquiringMicros: { W: 10000225 } } } protocol:op_query 10000ms
      ...
      [ContinuousStepdown:job0] Error while stepping down the primary on port 20005 of replica set 'shard-rs1'.
      ...
      [ShardedClusterFixture:job0:shard1:node0] 2018-05-19T00:17:43.711+0000 I STORAGE  [DataFileSync] flushing mmaps took 13151ms  for 7 files
      

            Assignee:
            max.hirschhorn@mongodb.com Max Hirschhorn
            Reporter:
            kaloian.manassiev@mongodb.com Kaloian Manassiev
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: