Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-24234

one primary mongod of the cluster out of service occasionally

    XMLWordPrintableJSON

Details

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Major - P3 Major - P3
    • None
    • None
    • Admin
    • None
    • ALL

    Description

      Cluster has 5 shard. A total of collection 3000+
      Global environment
      system: centos6.6
      shard 1 primary 2 secondary 2 arb
      mongos 3.0.8
      mongod 3.0.8 wt engine snapp compress 1 primary 2 secondary 2 arb
      mongo-configure 3.0.8 mmap engine

      At the time point around 2016-05-20T20:08:36.657, it seems that the primary mongod of one shard entered a unstable situation. It blocked
      almost all requests from the client for about 10 seconds (BTW, we're using mongo-java-driver 3.2.0).

      And we also tried to do some simple log analysis. We tried to count each command for every seconds. (Usage is 'command=count')
      We're not sure but we guess there is something wrong with our replcaSet.

      Fri May 20 20:08:35 CST 2016   {count=17, delete=1, findAndModify=202, insert=335, isMaster=4, logout=1, ping=1, replSetHeartbeat=3, replSetUpdatePosition=247, saslContinue=2, saslStart=1, serverStatus=1, setShardVersion=2, top=1, update=3}
      Fri May 20 20:08:36 CST 2016   {count=19, delete=1, findAndModify=137, insert=178, isMaster=2, logout=2, replSetHeartbeat=1, replSetUpdatePosition=171, saslContinue=4, saslStart=2, serverStatus=1, setShardVersion=28, update=3}
      Fri May 20 20:08:37 CST 2016   {isMaster=1, replSetHeartbeat=3, setShardVersion=159}
      Fri May 20 20:08:38 CST 2016   {isMaster=3, replSetHeartbeat=1, saslContinue=196, saslStart=98, setShardVersion=216}
      Fri May 20 20:08:39 CST 2016   {ping=1, replSetHeartbeat=3}
      Fri May 20 20:08:40 CST 2016   {isMaster=2, ping=1, replSetHeartbeat=1}
      Fri May 20 20:08:41 CST 2016   {isMaster=1, replSetHeartbeat=3}
      Fri May 20 20:08:42 CST 2016   {isMaster=1, replSetHeartbeat=1}
      Fri May 20 20:08:43 CST 2016   {isMaster=20, replSetHeartbeat=3}
      Fri May 20 20:08:44 CST 2016   {isMaster=180, ping=1, replSetHeartbeat=1}
      Fri May 20 20:08:45 CST 2016   {isMaster=222, ping=1, replSetHeartbeat=3}
      Fri May 20 20:08:46 CST 2016   {isMaster=175, replSetHeartbeat=1}
      Fri May 20 20:08:47 CST 2016   {isMaster=161, replSetHeartbeat=3}
      Fri May 20 20:08:48 CST 2016   {attempted=1, count=108, delete=7, findAndModify=476, insert=827, isMaster=199, replSetHeartbeat=1, replSetUpdatePosition=122, saslContinue=310, saslStart=609, serverStatus=1, setShardVersion=934, top=1, update=14}
      Fri May 20 20:08:49 CST 2016   {count=133, delete=10, findAndModify=503, insert=537, isMaster=387, logout=1, ping=1, replSetHeartbeat=3, replSetUpdatePosition=204, saslContinue=2049, saslStart=1117, serverStatus=1, setShardVersion=3790, update=16}
      

      We're expecting for your help. Thanks in advance.And we are also wondering to know more details of 'setShardVersion' command. What's the purpose of this command, and when will be triggered?

      The following are raw messages from log file:

      2016-05-20T20:08:36.647+0800 I QUERY    [conn36825] query xxx.aaa query: {.... } 2ms
      2016-05-20T20:08:36.648+0800 I COMMAND  [conn33760] command xxx.$cmd command: findAndModify { .....} 0ms
      2016-05-20T20:08:36.648+0800 I QUERY    [conn162] getmore local.oplog.rs query: { ts: { $gte: Timestamp 1463585585000|3 } } cursorid:169950952477 ntoreturn:0 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:2 reslen:879 locks:{ Global: { acquireCount: { r: 6 } }, Database: { acquireCount: { r: 3 } }, oplog: { acquireCount: { r: 3 } } } 8ms
      2016-05-20T20:08:36.649+0800 I QUERY    [conn158] getmore local.oplog.rs query: { ts: { $gte: Timestamp 1463585585000|3 } } cursorid:168949710400 ntoreturn:0 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:1 reslen:302 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, oplog: { acquireCount: { r: 1 } } } 0ms
      2016-05-20T20:08:36.650+0800 I COMMAND  [conn163] command admin.$cmd command: replSetUpdatePosition { replSetUpdatePosition: 1, optimes: [ { _id: ObjectId('565f116c85ef5155d15a0d14'), optime: Timestamp 1463746070000|112, memberId: 0, cfgver: 7, config: { _id: 0, host: "10.6.7.115:47003", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } }, { _id: ObjectId('5720be08143133e67c0fb1a8'), optime: Timestamp 1463585585000|3, memberId: 1, cfgver: 7, config: { _id: 1, host: "10.6.7.116:47003", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } }, { _id: ObjectId('56dd435072b07953b7619bcc'), optime: Timestamp 1463746116000|302, memberId: 5, cfgver: 7, config: { _id: 5, host: "10.6.7.129:47003", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } } ] } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:97 locks:{} 0ms
      2016-05-20T20:08:36.650+0800 I COMMAND  [conn163] command xxx.aaa command: replSetUpdatePosition { replSetUpdatePosition: 1, optimes: [ { _id: ObjectId('565f116c85ef5155d15a0d14'), optime: Timestamp 1463746070000|112, memberId: 0, cfgver: 7, config: { _id: 0, host: "10.6.7.115:47003", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } }, { _id: ObjectId('5720be08143133e67c0fb1a8'), optime: Timestamp 1463585585000|3, memberId: 1, cfgver: 7, config: { _id: 1, host: "10.6.7.116:47003", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } }, { _id: ObjectId('56dd435072b07953b7619bcc'), optime: Timestamp 1463746116000|303, memberId: 5, cfgver: 7, config: { _id: 5, host: "10.6.7.129:47003", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } } ] } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:97 locks:{} 0ms
      2016-05-20T20:08:36.650+0800 I QUERY    [conn36825]   query ....... 0ms
      2016-05-20T20:08:36.655+0800 I QUERY    [conn36825]   query ....... 0ms
      2016-05-20T20:08:36.655+0800 I QUERY    [conn23356]   query ....... 0ms
      2016-05-20T20:08:36.655+0800 I WRITE    [conn26584]   insert ...... 0ms
      2016-05-20T20:08:36.656+0800 I COMMAND  [conn26584] command bbb.xxx command: insert ... 0ms
      2016-05-20T20:08:36.656+0800 I QUERY    [conn162] getmore local.oplog.rs query: { ts: { $gte: Timestamp 1463585585000|3 } } cursorid:169950952477 ntoreturn:0 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:1 reslen:513 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, oplog: { acquireCount: { r: 2 } } } 4ms
      2016-05-20T20:08:36.656+0800 I QUERY    [conn158] getmore local.oplog.rs query: { ts: { $gte: Timestamp 1463585585000|3 } } cursorid:168949710400 ntoreturn:0 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:1 reslen:513 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, oplog: { acquireCount: { r: 2 } } } 4ms
      2016-05-20T20:08:36.656+0800 I QUERY    [conn33760] query bbb.xxx query: { userId: 28344562 } planSummary: IXSCAN { userId: 1.0 } ntoreturn:0 ntoskip:0 nscanned:5 nscannedObjects:5 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:5 reslen:1985 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } 1ms
      2016-05-20T20:08:36.657+0800 I COMMAND  [conn163] command admin.$cmd command: replSetUpdatePosition { replSetUpdatePosition: 1, optimes: [ { _id: ObjectId('565f116c85ef5155d15a0d14'), optime: Timestamp 1463746070000|112, memberId: 0, cfgver: 7, config: { _id: 0, host: "10.6.7.115:47003", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } }, { _id: ObjectId('5720be08143133e67c0fb1a8'), optime: Timestamp 1463585585000|3, memberId: 1, cfgver: 7, config: { _id: 1, host: "10.6.7.116:47003", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } }, { _id: ObjectId('56dd435072b07953b7619bcc'), optime: Timestamp 1463746116000|304, memberId: 5, cfgver: 7, config: { _id: 5, host: "10.6.7.129:47003", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } } ] } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:97 locks:{} 0ms
      2016-05-20T20:08:36.747+0800 I COMMAND  [conn34681] command admin.$cmd command: setShardVersion { setShardVersion: "xxx.bbb", configdb: "10.6.7.112:27001,10.6.7.115:27001,10.6.7.118:27001", shard: "47003", shardHost: "47003/10.6.7.115:47003,10.6.7.116:47003,10.6.7.129:47003", version: Timestamp 5000|14, versionEpoch: ObjectId('573cc54cc19de8acaafe1323') } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:146 locks:{} 0ms
      2016-05-20T20:08:36.749+0800 I COMMAND  [conn25365] command admin.$cmd command: setShardVersion { setShardVersion: "xxx.bbb", configdb: "10.6.7.112:27001,10.6.7.115:27001,10.6.7.118:27001", shard: "47003", shardHost: "47003/10.6.7.115:47003,10.6.7.116:47003,10.6.7.129:47003", version: Timestamp 5000|14, versionEpoch: ObjectId('573cc54cc19de8acaafe1323') } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:289 locks:{} 0ms
      2016-05-20T20:08:36.749+0800 I COMMAND  [conn25816] command admin.$cmd command: setShardVersion { setShardVersion: "xxx.bbb", configdb: "10.6.7.112:27001,10.6.7.115:27001,10.6.7.118:27001", shard: "47003", shardHost: "47003/10.6.7.115:47003,10.6.7.116:47003,10.6.7.129:47003", version: Timestamp 5000|26, versionEpoch: ObjectId('573cc54cc19de8acaafe1323') } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:146 locks:{} 0ms
      2016-05-20T20:08:36.751+0800 I NETWORK  [conn25365] end connection 10.6.7.111:39415 (1265 connections now open)
      2016-05-20T20:08:36.753+0800 I COMMAND  [conn36650] command admin.$cmd command: setShardVersion { setShardVersion: "xxx.bbb", configdb: "10.6.7.112:27001,10.6.7.115:27001,10.6.7.118:27001", shard: "47003", shardHost: "47003/10.6.7.115:47003,10.6.7.116:47003,10.6.7.129:47003", version: Timestamp 5000|26, versionEpoch: ObjectId('573cc54cc19de8acaafe1323') } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:146 locks:{} 0ms
      2016-05-20T20:08:36.767+0800 I COMMAND  [conn31145] command admin.$cmd command: setShardVersion { setShardVersion: "xxx.bbb", configdb: "10.6.7.112:27001,10.6.7.115:27001,10.6.7.118:27001", shard: "47003", shardHost: "47003/10.6.7.115:47003,10.6.7.116:47003,10.6.7.129:47003", version: Timestamp 5000|26, versionEpoch: ObjectId('573cc54cc19de8acaafe1323') } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:146 locks:{} 0ms
      2016-05-20T20:08:36.776+0800 I COMMAND  [conn349] command admin.$cmd command: setShardVersion { setShardVersion: "yyy.cccc", configdb: "10.6.7.112:27001,10.6.7.115:27001,10.6.7.118:27001", shard: "47003", shardHost: "47003/10.6.7.115:47003,10.6.7.116:47003,10.6.7.129:47003", version: Timestamp 644000|682, versionEpoch: ObjectId('55e46bd095f6e6a11b57c560') } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:146 locks:{} 0ms
      2016-05-20T20:08:36.776+0800 I COMMAND  [conn34452] command admin.$cmd command: setShardVersion { setShardVersion: "xxx.bbb", configdb: "10.6.7.112:27001,10.6.7.115:27001,10.6.7.118:27001", shard: "47003", shardHost: "47003/10.6.7.115:47003,10.6.7.116:47003,10.6.7.129:47003", version: Timestamp 5000|26, versionEpoch: ObjectId('573cc54cc19de8acaafe1323') } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:146 locks:{} 0ms
      2016-05-20T20:08:36.793+0800 I COMMAND  [conn36680] command admin.$cmd command: setShardVersion { setShardVersion: "xxx.bbb", configdb: "10.6.7.112:27001,10.6.7.115:27001,10.6.7.118:27001", shard: "47003", shardHost: "47003/10.6.7.115:47003,10.6.7.116:47003,10.6.7.129:47003", version: Timestamp 5000|26, versionEpoch: ObjectId('573cc54cc19de8acaafe1323') } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:146 locks:{} 0ms
      2016-05-20T20:08:36.807+0800 I COMMAND  [conn36643] command admin.$cmd command: setShardVersion { setShardVersion: "xxx.bbb", configdb: "10.6.7.112:27001,10.6.7.115:27001,10.6.7.118:27001", shard: "47003", shardHost: "47003/10.6.7.115:47003,10.6.7.116:47003,10.6.7.129:47003", version: Timestamp 5000|26, versionEpoch: ObjectId('573cc54cc19de8acaafe1323') } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:146 locks:{} 0ms
      2016-05-20T20:08:36.808+0800 I COMMAND  [conn34662] command admin.$cmd command: setShardVersion { setShardVersion: "xxx.bbb", configdb: "10.6.7.112:27001,10.6.7.115:27001,10.6.7.118:27001", shard: "47003", shardHost: "47003/10.6.7.115:47003,10.6.7.116:47003,10.6.7.129:47003", version: Timestamp 5000|26, versionEpoch: ObjectId('573cc54cc19de8acaafe1323') } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:146 locks:{} 0ms
      2016-05-20T20:08:36.823+0800 I COMMAND  [conn8764] command admin.$cmd command: logout { logout: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:97 locks:{} 0ms
      2016-05-20T20:08:36.823+0800 I COMMAND  [conn8764] command admin.$cmd command: isMaster { ismaster: 1 } keyUpdates:0 writeConflicts:0 numYields:0 reslen:539 locks:{} 0ms
      2016-05-20T20:08:36.826+0800 I COMMAND  [conn31256] command admin.$cmd command: setShardVersion { setShardVersion: "vox-exam.afentiWrongQuestionLibrary", configdb: "10.6.7.112:27001,10.6.7.115:27001,10.6.7.118:27001", shard: "47003", shardHost: "47003/10.6.7.115:47003,10.6.7.116:47003,10.6.7.129:47003", version: Timestamp 644000|682, versionEpoch: ObjectId('55e46bd095f6e6a11b57c560') } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:146 locks:{} 0ms
      2016-05-20T20:08:36.838+0800 I COMMAND  [conn19766] command admin.$cmd command: setShardVersion { setShardVersion: "vox-exam.afentiWrongQuestionLibrary", configdb: "10.6.7.112:27001,10.6.7.115:27001,10.6.7.118:27001", shard: "47003", shardHost: "47003/10.6.7.115:47003,10.6.7.116:47003,10.6.7.129:47003", version: Timestamp 644000|682, versionEpoch: ObjectId('55e46bd095f6e6a11b57c560') } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:146 locks:{} 0ms
      2016-05-20T20:08:36.853+0800 I NETWORK  [conn50144] end connection 10.6.7.115:46528 (1264 connections now open)
      2016-05-20T20:08:36.854+0800 I NETWORK  [initandlisten] connection accepted from 10.6.7.115:46532 #50148 (1265 connections now open)
      2016-05-20T20:08:36.878+0800 I COMMAND  [conn25438] command admin.$cmd command: setShardVersion { setShardVersion: "xxx.bbbb", configdb: "10.6.7.112:27001,10.6.7.115:27001,10.6.7.118:27001", shard: "47003", shardHost: "47003/10.6.7.115:47003,10.6.7.116:47003,10.6.7.129:47003", version: Timestamp 5000|26, versionEpoch: ObjectId('573cc54cc19de8acaafe1323') } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:146 locks:{} 0ms
      2016-05-20T20:08:36.880+0800 I COMMAND  [conn50148] command local.$cmd command: saslStart { saslStart: 1, mechanism: "SCRAM-SHA-1", payload: "xxx" } ntoreturn:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:239 locks:{} 22ms
      

      Attachments

        1. a.log
          12 kB
        2. aa.log.tgz
          205 kB
        3. QQ图片20160520230528.png
          QQ图片20160520230528.png
          17 kB

        Activity

          People

            Unassigned Unassigned
            liang.zhang@17zuoye.com ufo ofu
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: