Shard restore procedure leaving cluster in hung or inconsistent state.

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Community Answered
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • ALL
    • Hide

      shard config

      #where to log
      logpath = /var/log/mongo/mongod.log
      logappend = true
      dbpath = /dbdata/db
      profile = 1
      # location of pidfile
      pidfilepath = /var/run/mongodb/mongod.pid
      storageEngine = wiredTiger
       
      directoryperdb = False
      port = 27018
      bind_ip_all = True
      shardsvr = true
      replSet = analyticsshard1ReplSet-stage
      wiredTigerEngineConfigString=session_max=40000
      
      # MongoDB Security Options
      transitionToAuth = true
      keyFile = /var/lib/mongo/secrets/keyfile

      configdb config

      #where to log
      logpath = /var/log/mongo/mongod.log
      logappend = true
      dbpath = /dbdata/db
      profile = 1
      # location of pidfile
      pidfilepath = /var/run/mongodb/mongod.pid
      storageEngine = wiredTiger
      directoryperdb = False
      port = 27020
      bind_ip_all = True
      configsvr = true
      replSet = analyticsConfigdbReplSet-stage
      wiredTigerEngineConfigString=session_max=40000
      
      # MongoDB Security Options
      transitionToAuth = true
      keyFile = /var/lib/mongo/secrets/keyfile
      1. Stop mongoD on all shards and configdb
      2. Snapshot the data directory on original cluster (this was done with a GCP disk snapshot)
      3. Restore snapshots data directory to mountpoint
      4. Run through restore procedure via mongodb docs (v3.6)

      Note: The original cluster has replicasets (3 each shard and 3 configdbs)

      The restored cluster has 1 replica each in a PRIMARY state.

      Show
      shard config #where to log logpath = / var /log/mongo/mongod.log logappend = true dbpath = /dbdata/db profile = 1 # location of pidfile pidfilepath = / var /run/mongodb/mongod.pid storageEngine = wiredTiger directoryperdb = False port = 27018 bind_ip_all = True shardsvr = true replSet = analyticsshard1ReplSet-stage wiredTigerEngineConfigString=session_max=40000 # MongoDB Security Options transitionToAuth = true keyFile = / var /lib/mongo/secrets/keyfile configdb config #where to log logpath = / var /log/mongo/mongod.log logappend = true dbpath = /dbdata/db profile = 1 # location of pidfile pidfilepath = / var /run/mongodb/mongod.pid storageEngine = wiredTiger directoryperdb = False port = 27020 bind_ip_all = True configsvr = true replSet = analyticsConfigdbReplSet-stage wiredTigerEngineConfigString=session_max=40000 # MongoDB Security Options transitionToAuth = true keyFile = / var /lib/mongo/secrets/keyfile Stop mongoD on all shards and configdb Snapshot the data directory on original cluster (this was done with a GCP disk snapshot) Restore snapshots data directory to mountpoint Run through restore procedure via mongodb docs (v3.6) Note: The original cluster has replicasets (3 each shard and 3 configdbs) The restored cluster has 1 replica each in a PRIMARY state.
    • None
    • 3
    • None
    • None
    • None
    • None
    • None
    • None

      CentOS Linux release 7.8.2003 (Core)

      MongoD version 3.6.16

      mongoS version 3.6.16

      When restoring a shard from snapshots, mongoS cannot pull data consistently.

      Shard info can be pulled via "sh.status" command via mongoS.

      Running simple commands like "show collections" will hang and fail after the 30s timeout with "NetworkInterfaceExceededTimeLimit".

      All ports can be reached via shards to configdb and configdb to shards.

      Types of messages seen in one of the shard logs.

      2020-09-02T15:11:28.624+0000 I NETWORK  [shard registry reload] Marking host CONFIGDB_HOSTNAME:27020 as failed :: caused by :: NetworkInterfaceExceededTimeLimit: Operation timed out
      2020-09-02T15:11:28.624+0000 I SHARDING [shard registry reload] Operation timed out  :: caused by :: NetworkInterfaceExceededTimeLimit: Operation timed out
      2020-09-02T15:11:28.625+0000 I SHARDING [shard registry reload] Periodic reload of shard registry failed  :: caused by :: NetworkInterfaceExceededTimeLimit: could not get updated shard list from config server due to Operation timed out; will retry after 30

      Types of messages seen in configdb

       2020-09-02T15:12:51.600+0000 I COMMAND  [conn20665] Command on database config timed out waiting for read concern to be satisfied. Command: { find: "databases", filter: { _id: "REDACTED" }, readConcern: { level: "majority", afterOpTime: { ts: Timestamp(1598990710, 1), t: 6 } }, maxTimeMS: 30000, $readPreference: { mode: "nearest" }, $replData: 1, $clusterTime: { clusterTime: Timestamp(1599059535, 1), signature: { hash: BinData(0, CBA1C41E88C09DB4E41C843D8F384811DF5ACA90), keyId: 6846969707174559784 } }, $configServerState: { opTime: { ts: Timestamp(1598990710, 1), t: 6 } }, $db: "config" }. Info: ExceededTimeLimit: Error waiting for snapshot not less than { ts: Timestamp(1598990710, 1), t: 6 }, current relevant optime is { ts: Timestamp(1599059563, 1), t: 5 }. :: caused by :: operation exceeded time limit

       

      Sometimes when restarting the configdb once or twice it will fix the issue and mongoS will start pulling data again. The problem is this is very inconsistent and restarting the configdb only works sometimes to fix it.

       

            Assignee:
            Dmitry Agranat
            Reporter:
            Todd Vernick
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: